Pipeline Architecture: Pipe-and-Filter for Data Transforms

What This Concept Is

Pipeline architecture (historically pipe-and-filter) is one of the oldest patterns in software. It models a system as a sequence of stages where:

Filters do one unit of work -- transform, enrich, split, aggregate.
Pipes connect filters, carrying data one-way from one stage to the next.

[source] ==pipe==> [filter A] ==pipe==> [filter B] ==pipe==> [filter C] ==pipe==> [sink]

Filters are typically:

self-contained (one responsibility, no shared state with siblings)
stateless (the same input produces the same output)
independent (each knows its input contract and output contract, nothing else)

There are four canonical filter roles:

Role	Purpose	Example
Producer	Only emits data	read CSV file
Transformer	1 in, 1 out (changed)	parse JSON, apply schema
Tester / Splitter	1 in, multiple outputs (routing)	valid vs invalid rows
Consumer / Sink	Only receives	write to database

The classic example is the Unix shell: cat access.log | grep '500' | awk '{print $7}' | sort | uniq -c | sort -rn. Each tool is a filter; | is a pipe.

Why It Matters Here

Pipeline is the right default when:

the system's main job is transforming data, not responding to users
stages are naturally sequential and independent
the output of the process is the reason the process exists (not a database, a report or derived dataset)

This covers a large portion of real engineering work that gets miscategorized as "a service" or "a batch job":

ETL, ELT, data-lake ingestion
log processing, analytics enrichment
image/video transcoding pipelines
build systems (compile, link, package, sign)
ML training data preparation

If you force these into layered, you either end up with a degenerate single-layer "service," or worse, a layered app that grows a shadow pipeline inside its business layer. Pipeline is not a "basic" style; it is an explicit style, and calling it out in an ADR prevents accidental layered disguise.

Concrete Example

ETL for daily analytics, drawn as a mermaid pipeline:

What makes this pipeline and not just a set of functions:

each stage is independently runnable and testable
adding a new enrichment stage does not change the parser
a failure in one stage is local; messages sit in the pipe or dead-letter rather than crashing the whole process
the topology is visible and can be composed (swap warehouse sink for lake sink without touching earlier stages)

Common Confusion / Misconception

"Pipeline = batch." It is not. Pipes can be in-memory iterators, files, queues, Kafka topics, or anything else with a one-way handoff. Unix pipes are streaming. Kafka pipelines are streaming. Airflow/DAG pipelines are batch. The style is defined by topology and composability, not batch vs stream.

"If I chain functions in code, I have a pipeline." Only if filters are independent and pipes are explicit. A 500-line Python function with comments that say # Step 1 through # Step 7 is procedural, not pipeline. The test is replaceability: can you swap stage 4 for a different implementation without touching stages 3 or 5?

"Pipelines cannot fan out." They can, and the splitter/tester filter role exists precisely for that. What pipelines do not handle well is general graphs with cycles and back-pressure. Those are event-driven architecture (Cluster 3).

How To Use It

Concretely, when scoping a new pipeline:

Write the input contract (schema of the incoming data) and output contract (schema of the final result).
List stages left-to-right and label each role.
Pick a pipe mechanism per stage. Choices include in-memory iterators (small), files (batch), or queues/topics (streaming, multi-host).
For each stage, ensure it can be run in isolation with test fixtures. If a stage needs the state of another stage to work, it is not a filter.
Build observability per stage: throughput, error rate, latency. Pipelines fail at the slowest stage.

Check Yourself

What is the difference between a pipeline and a sequence of function calls in a layered service?
Why is pipeline rated high on simplicity and low on scalability (single-instance), according to Richards and Ford?
Name one system you maintain that is secretly a pipeline. What would change if you made the pipeline shape explicit in the code?

Mini Drill or Application

Design a pipeline for one of the following in 15 minutes:

An email-processing system: inbox -> extract attachments -> virus scan -> extract text -> classify -> file into folder.
A log-enrichment job: raw log line -> parse -> add geoip -> join user metadata -> write to warehouse.
A photo upload pipeline: upload -> validate format -> resize to three sizes -> watermark -> push to CDN.

For each:

draw the pipeline (mermaid or ASCII)
label each filter as producer/transformer/tester/sink
name the pipe (memory iterator, file, queue)
write one reason a layered architecture would have been worse

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​