15 Data Pipeline Architecture Patterns Every Engineer Should Know

Disclaimer: Architecture patterns, scalability characteristics, and design trade-offs mentioned in this article are accurate as of February 2025. Architectural decisions depend heavily on data volume, latency requirements, team size, and budget constraints. What works for one organization may not work for another. Always prototype architectures with representative workloads before committing to production.

The startup’s first data pipeline is a Python script running on someone’s laptop. It reads CSVs from email attachments, transforms them in memory, and writes results to a shared Excel file.

Six months later: 100x more data, pipelines running 24/7, multiple data sources, SLA requirements, audit logs, and a team of five engineers maintaining it all.

The laptop script doesn’t scale. Not because Python is slow or Excel is bad. Because the architecture — the fundamental design pattern — wasn’t built for production reality.

This is the architecture gap. Production data pipelines face challenges toy examples ignore: failures (networks, APIs, databases), scale (gigabytes become terabytes), compliance (audit logs, data lineage), cost (every query charges money), and operations (monitoring, alerting, debugging).

Good architecture patterns solve these systematically. They’re not code — they’re structural approaches to common problems. Learn the patterns, recognize when to apply them, and avoid reinventing solutions to solved problems.

What Makes Good Data Architecture

Before diving into patterns, here’s what separates production-ready architecture from prototypes:

Idempotent — Running the same pipeline twice produces the same result. No duplicate rows, no double-counting.

Recoverable — Failures don’t require manual intervention. Pipelines retry automatically and resume from checkpoints.

Observable — You know what’s running, what failed, and why. Logs, metrics, and lineage are built-in.

Scalable — Adding data doesn’t require rewriting code. Architecture handles 10x or 100x growth.

Cost-aware — Every query, every storage byte, every compute minute costs money. Good architecture optimizes for cost.

Testable — Can validate pipeline logic without running full production data. Unit tests, integration tests, data quality tests.

The Patterns (15 Architectural Approaches)

Batch Processing Patterns

1. ETL (Extract-Transform-Load)

The classic pattern: Extract from sources, transform, load to warehouse.

Architecture:

Source Systems → Extract → Transform → Load → Data Warehouse
   (APIs)         (Python)   (Spark)    (COPY)  (Snowflake)

Implementation:

# Airflow DAG - ETL Pattern
from airflow import DAG
from airflow.operators.python import PythonOperator
def extract():
    """Extract from source APIs"""
    data = []
    for api_url in SOURCE_APIS:
        response = requests.get(api_url)
        data.extend(response.json())
    
    # Store raw data
    pd.DataFrame(data).to_parquet('s3://bucket/raw/data.parquet')
def transform():
    """Transform raw data"""
    df = pd.read_parquet('s3://bucket/raw/data.parquet')
    
    # Clean, validate, enrich
    df['created_date'] = pd.to_datetime(df['timestamp']).dt.date
    df = df[df['status'].isin(['active', 'pending'])]
    
    # Store transformed
    df.to_parquet('s3://bucket/processed/data.parquet')
def load():
    """Load to warehouse"""
    snowflake_conn.cursor().execute("""
        COPY INTO analytics.fact_orders
        FROM 's3://bucket/processed/data.parquet'
        FILE_FORMAT = (TYPE = PARQUET)
    """)
with DAG('etl_pipeline', schedule_interval='@daily') as dag:
    extract_task = PythonOperator(task_id='extract', python_callable=extract)
    transform_task = PythonOperator(task_id='transform', python_callable=transform)
    load_task = PythonOperator(task_id='load', python_callable=load)
    
    extract_task >> transform_task >> load_task

When to use:

Batch processing (hourly, daily)
Full control over transformations
Transform logic is complex

Trade-offs:

✓ Simple, well-understood pattern
✓ Full transformation control
✗ Not real-time
✗ Resource-intensive (loads all data into memory)

2. ELT (Extract-Load-Transform)

Modern pattern: Load raw data first, transform in warehouse.

Architecture:

Source Systems → Extract → Load → Transform → Analytics
   (SaaS)        (Fivetran) (Snowflake) (dbt)   (Dashboards)

Implementation:

# Fivetran: Extract + Load (automated)
connector:
  name: salesforce
  destination: snowflake
  sync_frequency: hourly
# dbt: Transform (in-warehouse)
# models/staging/stg_salesforce_accounts.sql
SELECT
    Id AS account_id,
    Name AS account_name,
    Industry AS industry,
    AnnualRevenue AS annual_revenue
FROM {{ source('salesforce', 'Account') }}
# models/marts/fct_revenue.sql
SELECT
    account_id,
    DATE_TRUNC('month', opportunity_close_date) AS month,
    SUM(amount) AS monthly_revenue
FROM {{ ref('stg_salesforce_opportunities') }}
GROUP BY 1, 2

When to use:

Cloud data warehouses (Snowflake, BigQuery)
SaaS data sources
SQL-comfortable team

Trade-offs:

✓ Faster time to raw data
✓ Warehouse handles scale
✓ Version-controlled transformations (dbt)
✗ Warehouse compute costs
✗ Limited to SQL transformations

3. Micro-Batch Processing

Hybrid pattern: Process small batches frequently instead of one large batch.

Architecture:

Source → Queue → Micro-Batch Processor → Storage
(Kafka)  (Topic)  (Spark Structured Streaming) (Delta Lake)

Implementation:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("micro-batch").getOrCreate()
# Read from Kafka in micro-batches
df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "events") \
    .load()
# Process each micro-batch
query = df \
    .selectExpr("CAST(value AS STRING)") \
    .writeStream \
    .format("delta") \
    .option("checkpointLocation", "/checkpoints/events") \
    .trigger(processingTime='5 minutes')  # Micro-batch every 5 min
    .start("/data/events")

When to use:

Near real-time requirements (5–15 min latency acceptable)
Existing batch code that needs to be more frequent
Balance between batch and streaming

Trade-offs:

✓ Simpler than true streaming
✓ Lower latency than daily batches
✗ Not truly real-time
✗ More complex than simple batch

Streaming Patterns

4. Lambda Architecture

Dual path: Batch layer for accuracy, speed layer for real-time.

Architecture:

┌─ Batch Layer (Spark) ──→ Batch Views
Data Sources ────────┤
                     └─ Speed Layer (Flink) ──→ Real-time Views
                              ↓
                        Serving Layer (combines both)

Implementation:

# Batch Layer: Accurate, complete, slow
def batch_layer():
    """Process all historical data daily"""
    df = spark.read.parquet('s3://data-lake/events/')
    
    daily_stats = df.groupBy('user_id', 'date').agg({
        'event_count': 'sum',
        'revenue': 'sum'
    })
    
    daily_stats.write.mode('overwrite').parquet('s3://views/batch/')
# Speed Layer: Approximate, recent, fast  
def speed_layer():
    """Process recent data in real-time"""
    stream = flink_env.add_source(kafka_source)
    
    stream \
        .key_by(lambda x: x['user_id']) \
        .window(TumblingEventTimeWindows.of(Time.minutes(5))) \
        .aggregate(EventAggregator()) \
        .add_sink(redis_sink)
# Serving Layer: Merge batch + speed
def get_user_stats(user_id):
    # Get batch view (complete, older data)
    batch = read_from_parquet(f's3://views/batch/user={user_id}')
    
    # Get speed view (recent data)
    speed = redis.get(f'user:{user_id}:recent')
    
    # Merge
    return {
        'total_events': batch['events'] + speed['events'],
        'total_revenue': batch['revenue'] + speed['revenue']
    }

When to use:

Need both real-time and accurate batch processing
Can’t achieve accuracy in real-time
Have resources for two parallel systems

Trade-offs:

✓ Best of both worlds (real-time + accuracy)
✗ Complex (maintain two systems)
✗ Expensive (duplicate infrastructure)
✗ Eventual consistency issues

5. Kappa Architecture

Single path: Everything is streaming. Batch is just replaying the stream.

Architecture:

Data Sources → Kafka → Stream Processing → Storage
                ↓
            (Replay for batch)

Implementation:

# Single stream processor handles both real-time and batch
from pyflink.datastream import StreamExecutionEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
# Kafka source (can replay for batch)
kafka_source = FlinkKafkaConsumer(
    topics=['events'],
    deserialization_schema=JsonSchema(),
    properties={'bootstrap.servers': 'localhost:9092'}
)
# Stream processing (works for real-time and batch replay)
events = env.add_source(kafka_source)
stats = events \
    .key_by(lambda x: x['user_id']) \
    .window(TumblingEventTimeWindows.of(Time.hours(1))) \
    .aggregate(StatsAggregator())
stats.add_sink(output_sink)
# For batch: seek to beginning, process all, stop
# For real-time: process continuously

When to use:

Everything can be modeled as stream
Want to avoid duplicate code
Kafka is central to architecture

Trade-offs:

✓ Simpler than Lambda (one codebase)
✓ Kafka replay enables “batch”
✗ Everything must be streamable
✗ Kafka storage costs for replay

Incremental Processing Patterns

6. Delta Processing (Process Only Changes)

Pattern: Track what changed, process only deltas.

Implementation:

-- dbt incremental model
{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge'
) }}
SELECT
    order_id,
    user_id,
    order_date,
    amount,
    status,
    updated_at
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
    -- Only process changed/new records
    WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

With change data capture (CDC):

# Debezium CDC captures database changes
{
    "before": {"order_id": 123, "status": "pending", "amount": 100},
    "after": {"order_id": 123, "status": "completed", "amount": 100},
    "op": "u",  # update
    "ts_ms": 1707580800000
}
# Process only changes
def process_cdc_event(event):
    if event['op'] == 'c':  # create
        insert_to_warehouse(event['after'])
    elif event['op'] == 'u':  # update
        update_warehouse(event['after'])
    elif event['op'] == 'd':  # delete
        delete_from_warehouse(event['before'])

When to use:

Large datasets where full refresh is expensive
Source system provides change tracking
Append-only or update patterns

Trade-offs:

✓ Much faster than full refresh
✓ Lower cost (less data processed)
✗ Requires change tracking
✗ More complex than full refresh

7. Slowly Changing Dimensions (SCD Type 2)

Pattern: Track historical changes to dimension data.

Implementation:

-- dbt snapshot (SCD Type 2)
{% snapshot orders_snapshot %}
    {{
        config(
            target_schema='snapshots',
            unique_key='order_id',
            strategy='timestamp',
            updated_at='updated_at'
        )
    }}
    
    SELECT * FROM {{ source('raw', 'orders') }}
{% endsnapshot %}
-- Creates table with history tracking:
-- order_id | status | amount | dbt_valid_from | dbt_valid_to
-- 123      | pending| 100    | 2024-01-01     | 2024-01-05
-- 123      | paid   | 100    | 2024-01-05     | NULL (current)

Manual implementation:

def scd_type2_update(new_data, dimension_table):
    """Update dimension with history tracking"""
    
    for record in new_data:
        existing = dimension_table.get(record['id'])
        
        if not existing:
            # New record
            insert({
                **record,
                'valid_from': today(),
                'valid_to': None,
                'is_current': True
            })
        elif has_changes(existing, record):
            # Record changed - close old, insert new
            update(existing['id'], {
                'valid_to': today(),
                'is_current': False
            })
            insert({
                **record,
                'valid_from': today(),
                'valid_to': None,
                'is_current': True
            })

When to use:

Need historical dimension values
Audit requirements
Time-based analysis (“what was status on date X?“)

Data Quality Patterns

8. Schema on Write vs Schema on Read

Schema on Write: Enforce schema when writing (traditional databases).

Schema on Read: Store raw, apply schema when reading (data lakes).

Implementation:

# Schema on Write (strict)
CREATE TABLE orders (
    order_id BIGINT NOT NULL,
    amount DECIMAL(10,2) NOT NULL CHECK (amount >= 0),
    status VARCHAR(20) CHECK (status IN ('pending', 'completed')),
    PRIMARY KEY (order_id)
);
# Bad data rejected on INSERT
INSERT INTO orders VALUES (123, -100, 'invalid');  -- ERROR
# Schema on Read (flexible)
# Write raw JSON to S3
s3.put_object(
    Bucket='data-lake',
    Key='orders/2024-01-01.json',
    Body=json.dumps(raw_data)  # Any structure accepted
)
# Apply schema on read
df = spark.read.json('s3://data-lake/orders/*.json')
df_cleaned = df \
    .filter(col('amount') >= 0) \
    .filter(col('status').isin(['pending', 'completed']))

When to use:

Schema on Write: Transactional systems, strong guarantees
Schema on Read: Exploratory analysis, schema evolution

9. Dead Letter Queue Pattern

Pattern: Route failed records to separate queue for manual inspection.

Implementation:

def process_with_dlq(record):
    """Process record with dead letter queue for failures"""
    try:
        # Validate
        if not is_valid(record):
            send_to_dlq(record, reason="validation_failed")
            return
        
        # Transform
        transformed = transform(record)
        
        # Load
        load_to_warehouse(transformed)
        
    except Exception as e:
        # Send to DLQ for manual inspection
        send_to_dlq(record, reason=str(e), stacktrace=traceback.format_exc())
        
        # Alert if DLQ growing
        if dlq_size() > THRESHOLD:
            alert_team("DLQ size exceeded threshold")
def send_to_dlq(record, reason, stacktrace=None):
    """Send failed record to dead letter queue"""
    dlq_record = {
        'original_record': record,
        'failure_reason': reason,
        'stacktrace': stacktrace,
        'timestamp': datetime.now(),
        'pipeline': 'etl_pipeline'
    }
    
    s3.put_object(
        Bucket='dlq-bucket',
        Key=f"dlq/{datetime.now().date()}/{uuid.uuid4()}.json",
        Body=json.dumps(dlq_record)
    )

When to use:

Production pipelines that can’t fail on bad data
Need to inspect failures manually
Want to continue processing good records

Orchestration Patterns

10. Event-Driven Architecture

Pattern: Pipelines trigger based on events, not schedules.

Implementation:

# S3 event triggers Lambda
# AWS Lambda function triggered on S3 upload
import boto3
def lambda_handler(event, context):
    """Triggered when file lands in S3"""
    
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key']
        
        # Trigger processing
        if key.startswith('raw/orders/'):
            trigger_order_processing(bucket, key)
        elif key.startswith('raw/users/'):
            trigger_user_processing(bucket, key)
# Airflow with sensor
from airflow.sensors.s3 import S3KeySensor
s3_sensor = S3KeySensor(
    task_id='wait_for_file',
    bucket_name='data-bucket',
    bucket_key='raw/orders/{{ ds }}.csv',
    poke_interval=300  # Check every 5 minutes
)
process_task = PythonOperator(
    task_id='process',
    python_callable=process_orders
)
s3_sensor >> process_task

When to use:

Data arrives irregularly
Want to process immediately when data available
Reduce unnecessary polling

11. Fan-Out/Fan-In Pattern

Pattern: Parallel processing then aggregation.

Implementation:

# Airflow dynamic task generation
from airflow import DAG
from airflow.operators.python import PythonOperator
def fan_out_tasks():
    """Generate one task per partition"""
    partitions = get_partitions()  # [2024-01-01, 2024-01-02, ...]
    
    tasks = []
    for partition in partitions:
        task = PythonOperator(
            task_id=f'process_{partition}',
            python_callable=process_partition,
            op_kwargs={'partition': partition}
        )
        tasks.append(task)
    
    return tasks
def fan_in_task(**context):
    """Combine results from all partitions"""
    results = []
    for partition in get_partitions():
        task_id = f'process_{partition}'
        result = context['ti'].xcom_pull(task_ids=task_id)
        results.append(result)
    
    # Combine all results
    final_result = combine(results)
    return final_result
with DAG('fan_out_fan_in') as dag:
    start = DummyOperator(task_id='start')
    
    # Fan out
    process_tasks = fan_out_tasks()
    
    # Fan in
    combine_task = PythonOperator(
        task_id='combine',
        python_callable=fan_in_task
    )
    
    start >> process_tasks >> combine_task

Cost Optimization Patterns

12. Partitioning Strategy

Pattern: Organize data to minimize scanning.

Implementation:

# Write partitioned data
df.write \
    .partitionBy('year', 'month', 'day') \
    .parquet('s3://bucket/events/')
# File structure:
# events/year=2024/month=01/day=01/data.parquet
# events/year=2024/month=01/day=02/data.parquet
# Query only relevant partitions
SELECT * FROM events
WHERE year = 2024 AND month = 1 AND day = 15
# Scans only: events/year=2024/month=01/day=15/
# Instead of: entire events/ directory
# Cost: $0.05 vs $50 (1000x cheaper)

13. Data Tiering (Hot/Warm/Cold)

Pattern: Move old data to cheaper storage.

Implementation:

# S3 lifecycle policies
s3_lifecycle = {
    'Rules': [{
        'Id': 'tier-old-data',
        'Status': 'Enabled',
        'Transitions': [
            {
                'Days': 30,
                'StorageClass': 'STANDARD_IA'  # Infrequent Access
            },
            {
                'Days': 90,
                'StorageClass': 'GLACIER'  # Archive
            }
        ]
    }]
}
s3.put_bucket_lifecycle_configuration(
    Bucket='data-lake',
    LifecycleConfiguration=s3_lifecycle
)
# Snowflake time travel + auto-clustering
ALTER TABLE events SET DATA_RETENTION_TIME_IN_DAYS = 7;
ALTER TABLE events CLUSTER BY (event_date);

Modern Patterns

14. Medallion Architecture (Bronze/Silver/Gold)

Pattern: Layer data by refinement level.

Architecture:

Bronze (Raw)    → Silver (Cleaned)   → Gold (Business)
Raw JSON/CSV      Validated            Aggregated
No transformation  Deduplicated         Joined
All data          Typed columns        Analytics-ready

Implementation:

# Bronze: Raw ingestion
bronze_df = spark.read.json('s3://landing/events/')
bronze_df.write.format('delta').save('s3://bronze/events/')
# Silver: Clean and validate
silver_df = spark.read.format('delta').load('s3://bronze/events/') \
    .dropDuplicates(['event_id']) \
    .filter(col('event_date').isNotNull()) \
    .withColumn('event_timestamp', to_timestamp('event_time'))
silver_df.write.format('delta').save('s3://silver/events/')
# Gold: Business logic
gold_df = spark.read.format('delta').load('s3://silver/events/') \
    .join(users, 'user_id') \
    .groupBy('date', 'user_segment') \
    .agg(
        count('event_id').alias('event_count'),
        sum('revenue').alias('total_revenue')
    )
gold_df.write.format('delta').save('s3://gold/daily_metrics/')

When to use:

Data lakehouse architecture
Multiple consumers with different needs
Want clear data quality zones

15. Reverse ETL

Pattern: Push warehouse data back to operational systems.

Implementation:

# Sync warehouse segment to Salesforce
from census import Census
census = Census(api_key='...')
# Define sync
sync = census.create_sync(
    source={
        'connection': 'snowflake',
        'query': '''
            SELECT 
                email,
                total_lifetime_value,
                last_purchase_date,
                customer_segment
            FROM analytics.customer_360
            WHERE total_lifetime_value > 10000
        '''
    },
    destination={
        'connection': 'salesforce',
        'object': 'Contact',
        'mappings': {
            'email': 'Email',
            'total_lifetime_value': 'Lifetime_Value__c',
            'customer_segment': 'Segment__c'
        }
    },
    schedule='daily'
)

When to use:

Activate warehouse insights in operational systems
Personalization, marketing automation
Close the loop from analytics to operations

Choosing the Right Pattern

Decision framework:

By latency requirement:

< 1 second: True streaming (Flink, Kafka Streams)
< 15 minutes: Micro-batch (Spark Streaming)
< 1 hour: Frequent batch
Daily+: Traditional batch (ETL/ELT)

By data volume:

< 100 GB: Single-machine processing
100 GB — 10 TB: Distributed batch (Spark)
10 TB: Partitioned lakehouse (Delta, Iceberg)

By team skills:

SQL-focused: ELT pattern (dbt)
Python-focused: ETL pattern (Airflow + pandas)
Mixed: Modern data stack (Fivetran + dbt)

By cost sensitivity:

Budget-constrained: ELT (warehouse compute is predictable)
Scale-constrained: Data lake (cheaper storage)

The Bottom Line

Architecture patterns are templates, not rules.

Start simple:

Begin with ELT if using cloud warehouse
Add incremental processing when full refresh gets slow
Implement data quality checks early
Partition as data grows
Add streaming only when batch latency insufficient

Common mistakes:

Over-engineering early (Lambda architecture for 100 GB data)
Under-engineering late (daily batch when real-time needed)
Ignoring cost (full table scans on petabyte data)
No monitoring (can’t debug what you can’t see)

Best practices:

Make pipelines idempotent (reruns produce same result)
Checkpoint progress (can resume from failure)
Monitor everything (data volume, latency, cost, quality)
Version control (infrastructure as code)
Test thoroughly (unit, integration, data quality)

The best architecture is the simplest one that meets requirements. The second-best is the one you can actually maintain.

Good architecture is invisible when it works and obvious when it doesn’t. Start simple, scale when needed.

Keen's Clippings

Explorer

15 Data Pipeline Architecture Patterns Every Engineer Should Know | by Reliable Data Engineering

What Makes Good Data Architecture

The Patterns (15 Architectural Approaches)

Batch Processing Patterns

Streaming Patterns

Incremental Processing Patterns

Data Quality Patterns

Orchestration Patterns

Cost Optimization Patterns

Modern Patterns

Choosing the Right Pattern

The Bottom Line

Graph View

Table of Contents