Pipeline Architecture: Pipe-and-Filter for Data Transforms
What This Concept Is
Pipeline architecture (historically pipe-and-filter) is one of the oldest patterns in software. It models a system as a sequence of stages where:
- Filters do one unit of work -- transform, enrich, split, aggregate.
- Pipes connect filters, carrying data one-way from one stage to the next.
[source] ==pipe==> [filter A] ==pipe==> [filter B] ==pipe==> [filter C] ==pipe==> [sink]
Filters are typically:
- self-contained (one responsibility, no shared state with siblings)
- stateless (the same input produces the same output)
- independent (each knows its input contract and output contract, nothing else)
There are four canonical filter roles:
| Role | Purpose | Example |
|---|---|---|
| Producer | Only emits data | read CSV file |
| Transformer | 1 in, 1 out (changed) | parse JSON, apply schema |
| Tester / Splitter | 1 in, multiple outputs (routing) | valid vs invalid rows |
| Consumer / Sink | Only receives | write to database |
The classic example is the Unix shell: cat access.log | grep '500' | awk '{print $7}' | sort | uniq -c | sort -rn. Each tool is a filter; | is a pipe.
Why It Matters Here
Pipeline is the right default when:
- the system's main job is transforming data, not responding to users
- stages are naturally sequential and independent
- the output of the process is the reason the process exists (not a database, a report or derived dataset)
This covers a large portion of real engineering work that gets miscategorized as "a service" or "a batch job":
- ETL, ELT, data-lake ingestion
- log processing, analytics enrichment
- image/video transcoding pipelines
- build systems (compile, link, package, sign)
- ML training data preparation
If you force these into layered, you either end up with a degenerate single-layer "service," or worse, a layered app that grows a shadow pipeline inside its business layer. Pipeline is not a "basic" style; it is an explicit style, and calling it out in an ADR prevents accidental layered disguise.
Concrete Example
ETL for daily analytics, drawn as a mermaid pipeline:
What makes this pipeline and not just a set of functions:
- each stage is independently runnable and testable
- adding a new enrichment stage does not change the parser
- a failure in one stage is local; messages sit in the pipe or dead-letter rather than crashing the whole process
- the topology is visible and can be composed (swap
warehousesink forlakesink without touching earlier stages)
Common Confusion / Misconception
"Pipeline = batch." It is not. Pipes can be in-memory iterators, files, queues, Kafka topics, or anything else with a one-way handoff. Unix pipes are streaming. Kafka pipelines are streaming. Airflow/DAG pipelines are batch. The style is defined by topology and composability, not batch vs stream.
"If I chain functions in code, I have a pipeline." Only if filters are independent and pipes are explicit. A 500-line Python function with comments that say # Step 1 through # Step 7 is procedural, not pipeline. The test is replaceability: can you swap stage 4 for a different implementation without touching stages 3 or 5?
"Pipelines cannot fan out." They can, and the splitter/tester filter role exists precisely for that. What pipelines do not handle well is general graphs with cycles and back-pressure. Those are event-driven architecture (Cluster 3).
How To Use It
Concretely, when scoping a new pipeline:
- Write the input contract (schema of the incoming data) and output contract (schema of the final result).
- List stages left-to-right and label each role.
- Pick a pipe mechanism per stage. Choices include in-memory iterators (small), files (batch), or queues/topics (streaming, multi-host).
- For each stage, ensure it can be run in isolation with test fixtures. If a stage needs the state of another stage to work, it is not a filter.
- Build observability per stage: throughput, error rate, latency. Pipelines fail at the slowest stage.
Check Yourself
- What is the difference between a pipeline and a sequence of function calls in a layered service?
- Why is pipeline rated high on simplicity and low on scalability (single-instance), according to Richards and Ford?
- Name one system you maintain that is secretly a pipeline. What would change if you made the pipeline shape explicit in the code?
Mini Drill or Application
Design a pipeline for one of the following in 15 minutes:
- An email-processing system: inbox -> extract attachments -> virus scan -> extract text -> classify -> file into folder.
- A log-enrichment job: raw log line -> parse -> add geoip -> join user metadata -> write to warehouse.
- A photo upload pipeline: upload -> validate format -> resize to three sizes -> watermark -> push to CDN.
For each:
- draw the pipeline (mermaid or ASCII)
- label each filter as producer/transformer/tester/sink
- name the pipe (memory iterator, file, queue)
- write one reason a layered architecture would have been worse