Disclaimer: Architecture patterns, scalability characteristics, and design trade-offs mentioned in this article are accurate as of February 2025. Architectural decisions depend heavily on data volume, latency requirements, team size, and budget constraints. What works for one organization may not work for another. Always prototype architectures with representative workloads before committing to production.
The startup’s first data pipeline is a Python script running on someone’s laptop. It reads CSVs from email attachments, transforms them in memory, and writes results to a shared Excel file.
Six months later: 100x more data, pipelines running 24/7, multiple data sources, SLA requirements, audit logs, and a team of five engineers maintaining it all.
The laptop script doesn’t scale. Not because Python is slow or Excel is bad. Because the architecture — the fundamental design pattern — wasn’t built for production reality.
This is the architecture gap. Production data pipelines face challenges toy examples ignore: failures (networks, APIs, databases), scale (gigabytes become terabytes), compliance (audit logs, data lineage), cost (every query charges money), and operations (monitoring, alerting, debugging).
Good architecture patterns solve these systematically. They’re not code — they’re structural approaches to common problems. Learn the patterns, recognize when to apply them, and avoid reinventing solutions to solved problems.
What Makes Good Data Architecture
Before diving into patterns, here’s what separates production-ready architecture from prototypes:
Idempotent — Running the same pipeline twice produces the same result. No duplicate rows, no double-counting.
Recoverable — Failures don’t require manual intervention. Pipelines retry automatically and resume from checkpoints.
Observable — You know what’s running, what failed, and why. Logs, metrics, and lineage are built-in.
Scalable — Adding data doesn’t require rewriting code. Architecture handles 10x or 100x growth.
Cost-aware — Every query, every storage byte, every compute minute costs money. Good architecture optimizes for cost.
Testable — Can validate pipeline logic without running full production data. Unit tests, integration tests, data quality tests.
The Patterns (15 Architectural Approaches)
Batch Processing Patterns
1. ETL (Extract-Transform-Load)
The classic pattern: Extract from sources, transform, load to warehouse.
Architecture:
Source Systems → Extract → Transform → Load → Data Warehouse
(APIs) (Python) (Spark) (COPY) (Snowflake)
Implementation:
# Airflow DAG - ETL Pattern
from airflow import DAG
from airflow.operators.python import PythonOperator
def extract():
"""Extract from source APIs"""
data = []
for api_url in SOURCE_APIS:
response = requests.get(api_url)
data.extend(response.json())
# Store raw data
pd.DataFrame(data).to_parquet('s3://bucket/raw/data.parquet')
def transform():
"""Transform raw data"""
df = pd.read_parquet('s3://bucket/raw/data.parquet')
# Clean, validate, enrich
df['created_date'] = pd.to_datetime(df['timestamp']).dt.date
df = df[df['status'].isin(['active', 'pending'])]
# Store transformed
df.to_parquet('s3://bucket/processed/data.parquet')
def load():
"""Load to warehouse"""
snowflake_conn.cursor().execute("""
COPY INTO analytics.fact_orders
FROM 's3://bucket/processed/data.parquet'
FILE_FORMAT = (TYPE = PARQUET)
""")
with DAG('etl_pipeline', schedule_interval='@daily') as dag:
extract_task = PythonOperator(task_id='extract', python_callable=extract)
transform_task = PythonOperator(task_id='transform', python_callable=transform)
load_task = PythonOperator(task_id='load', python_callable=load)
extract_task >> transform_task >> load_task
When to use:
- Batch processing (hourly, daily)
- Full control over transformations
- Transform logic is complex
Trade-offs:
- ✓ Simple, well-understood pattern
- ✓ Full transformation control
- ✗ Not real-time
- ✗ Resource-intensive (loads all data into memory)
2. ELT (Extract-Load-Transform)
Modern pattern: Load raw data first, transform in warehouse.
Architecture:
Source Systems → Extract → Load → Transform → Analytics
(SaaS) (Fivetran) (Snowflake) (dbt) (Dashboards)
Implementation:
# Fivetran: Extract + Load (automated)
connector:
name: salesforce
destination: snowflake
sync_frequency: hourly
# dbt: Transform (in-warehouse)
# models/staging/stg_salesforce_accounts.sql
SELECT
Id AS account_id,
Name AS account_name,
Industry AS industry,
AnnualRevenue AS annual_revenue
FROM {{ source('salesforce', 'Account') }}
# models/marts/fct_revenue.sql
SELECT
account_id,
DATE_TRUNC('month', opportunity_close_date) AS month,
SUM(amount) AS monthly_revenue
FROM {{ ref('stg_salesforce_opportunities') }}
GROUP BY 1, 2
When to use:
- Cloud data warehouses (Snowflake, BigQuery)
- SaaS data sources
- SQL-comfortable team
Trade-offs:
- ✓ Faster time to raw data
- ✓ Warehouse handles scale
- ✓ Version-controlled transformations (dbt)
- ✗ Warehouse compute costs
- ✗ Limited to SQL transformations
3. Micro-Batch Processing
Hybrid pattern: Process small batches frequently instead of one large batch.
Architecture:
Source → Queue → Micro-Batch Processor → Storage
(Kafka) (Topic) (Spark Structured Streaming) (Delta Lake)
Implementation:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("micro-batch").getOrCreate()
# Read from Kafka in micro-batches
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "events") \
.load()
# Process each micro-batch
query = df \
.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("delta") \
.option("checkpointLocation", "/checkpoints/events") \
.trigger(processingTime='5 minutes') # Micro-batch every 5 min
.start("/data/events")
When to use:
- Near real-time requirements (5–15 min latency acceptable)
- Existing batch code that needs to be more frequent
- Balance between batch and streaming
Trade-offs:
- ✓ Simpler than true streaming
- ✓ Lower latency than daily batches
- ✗ Not truly real-time
- ✗ More complex than simple batch
Streaming Patterns
4. Lambda Architecture
Dual path: Batch layer for accuracy, speed layer for real-time.
Architecture:
┌─ Batch Layer (Spark) ──→ Batch Views
Data Sources ────────┤
└─ Speed Layer (Flink) ──→ Real-time Views
↓
Serving Layer (combines both)
Implementation:
# Batch Layer: Accurate, complete, slow
def batch_layer():
"""Process all historical data daily"""
df = spark.read.parquet('s3://data-lake/events/')
daily_stats = df.groupBy('user_id', 'date').agg({
'event_count': 'sum',
'revenue': 'sum'
})
daily_stats.write.mode('overwrite').parquet('s3://views/batch/')
# Speed Layer: Approximate, recent, fast
def speed_layer():
"""Process recent data in real-time"""
stream = flink_env.add_source(kafka_source)
stream \
.key_by(lambda x: x['user_id']) \
.window(TumblingEventTimeWindows.of(Time.minutes(5))) \
.aggregate(EventAggregator()) \
.add_sink(redis_sink)
# Serving Layer: Merge batch + speed
def get_user_stats(user_id):
# Get batch view (complete, older data)
batch = read_from_parquet(f's3://views/batch/user={user_id}')
# Get speed view (recent data)
speed = redis.get(f'user:{user_id}:recent')
# Merge
return {
'total_events': batch['events'] + speed['events'],
'total_revenue': batch['revenue'] + speed['revenue']
}
When to use:
- Need both real-time and accurate batch processing
- Can’t achieve accuracy in real-time
- Have resources for two parallel systems
Trade-offs:
- ✓ Best of both worlds (real-time + accuracy)
- ✗ Complex (maintain two systems)
- ✗ Expensive (duplicate infrastructure)
- ✗ Eventual consistency issues
5. Kappa Architecture
Single path: Everything is streaming. Batch is just replaying the stream.
Architecture:
Data Sources → Kafka → Stream Processing → Storage
↓
(Replay for batch)
Implementation:
# Single stream processor handles both real-time and batch
from pyflink.datastream import StreamExecutionEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
# Kafka source (can replay for batch)
kafka_source = FlinkKafkaConsumer(
topics=['events'],
deserialization_schema=JsonSchema(),
properties={'bootstrap.servers': 'localhost:9092'}
)
# Stream processing (works for real-time and batch replay)
events = env.add_source(kafka_source)
stats = events \
.key_by(lambda x: x['user_id']) \
.window(TumblingEventTimeWindows.of(Time.hours(1))) \
.aggregate(StatsAggregator())
stats.add_sink(output_sink)
# For batch: seek to beginning, process all, stop
# For real-time: process continuously
When to use:
- Everything can be modeled as stream
- Want to avoid duplicate code
- Kafka is central to architecture
Trade-offs:
- ✓ Simpler than Lambda (one codebase)
- ✓ Kafka replay enables “batch”
- ✗ Everything must be streamable
- ✗ Kafka storage costs for replay
Incremental Processing Patterns
6. Delta Processing (Process Only Changes)
Pattern: Track what changed, process only deltas.
Implementation:
-- dbt incremental model
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
) }}
SELECT
order_id,
user_id,
order_date,
amount,
status,
updated_at
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
-- Only process changed/new records
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
With change data capture (CDC):
# Debezium CDC captures database changes
{
"before": {"order_id": 123, "status": "pending", "amount": 100},
"after": {"order_id": 123, "status": "completed", "amount": 100},
"op": "u", # update
"ts_ms": 1707580800000
}
# Process only changes
def process_cdc_event(event):
if event['op'] == 'c': # create
insert_to_warehouse(event['after'])
elif event['op'] == 'u': # update
update_warehouse(event['after'])
elif event['op'] == 'd': # delete
delete_from_warehouse(event['before'])
When to use:
- Large datasets where full refresh is expensive
- Source system provides change tracking
- Append-only or update patterns
Trade-offs:
- ✓ Much faster than full refresh
- ✓ Lower cost (less data processed)
- ✗ Requires change tracking
- ✗ More complex than full refresh
7. Slowly Changing Dimensions (SCD Type 2)
Pattern: Track historical changes to dimension data.
Implementation:
-- dbt snapshot (SCD Type 2)
{% snapshot orders_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='order_id',
strategy='timestamp',
updated_at='updated_at'
)
}}
SELECT * FROM {{ source('raw', 'orders') }}
{% endsnapshot %}
-- Creates table with history tracking:
-- order_id | status | amount | dbt_valid_from | dbt_valid_to
-- 123 | pending| 100 | 2024-01-01 | 2024-01-05
-- 123 | paid | 100 | 2024-01-05 | NULL (current)
Manual implementation:
def scd_type2_update(new_data, dimension_table):
"""Update dimension with history tracking"""
for record in new_data:
existing = dimension_table.get(record['id'])
if not existing:
# New record
insert({
**record,
'valid_from': today(),
'valid_to': None,
'is_current': True
})
elif has_changes(existing, record):
# Record changed - close old, insert new
update(existing['id'], {
'valid_to': today(),
'is_current': False
})
insert({
**record,
'valid_from': today(),
'valid_to': None,
'is_current': True
})
When to use:
- Need historical dimension values
- Audit requirements
- Time-based analysis (“what was status on date X?“)
Data Quality Patterns
8. Schema on Write vs Schema on Read
Schema on Write: Enforce schema when writing (traditional databases).
Schema on Read: Store raw, apply schema when reading (data lakes).
Implementation:
# Schema on Write (strict)
CREATE TABLE orders (
order_id BIGINT NOT NULL,
amount DECIMAL(10,2) NOT NULL CHECK (amount >= 0),
status VARCHAR(20) CHECK (status IN ('pending', 'completed')),
PRIMARY KEY (order_id)
);
# Bad data rejected on INSERT
INSERT INTO orders VALUES (123, -100, 'invalid'); -- ERROR
# Schema on Read (flexible)
# Write raw JSON to S3
s3.put_object(
Bucket='data-lake',
Key='orders/2024-01-01.json',
Body=json.dumps(raw_data) # Any structure accepted
)
# Apply schema on read
df = spark.read.json('s3://data-lake/orders/*.json')
df_cleaned = df \
.filter(col('amount') >= 0) \
.filter(col('status').isin(['pending', 'completed']))
When to use:
- Schema on Write: Transactional systems, strong guarantees
- Schema on Read: Exploratory analysis, schema evolution
9. Dead Letter Queue Pattern
Pattern: Route failed records to separate queue for manual inspection.
Implementation:
def process_with_dlq(record):
"""Process record with dead letter queue for failures"""
try:
# Validate
if not is_valid(record):
send_to_dlq(record, reason="validation_failed")
return
# Transform
transformed = transform(record)
# Load
load_to_warehouse(transformed)
except Exception as e:
# Send to DLQ for manual inspection
send_to_dlq(record, reason=str(e), stacktrace=traceback.format_exc())
# Alert if DLQ growing
if dlq_size() > THRESHOLD:
alert_team("DLQ size exceeded threshold")
def send_to_dlq(record, reason, stacktrace=None):
"""Send failed record to dead letter queue"""
dlq_record = {
'original_record': record,
'failure_reason': reason,
'stacktrace': stacktrace,
'timestamp': datetime.now(),
'pipeline': 'etl_pipeline'
}
s3.put_object(
Bucket='dlq-bucket',
Key=f"dlq/{datetime.now().date()}/{uuid.uuid4()}.json",
Body=json.dumps(dlq_record)
)
When to use:
- Production pipelines that can’t fail on bad data
- Need to inspect failures manually
- Want to continue processing good records
Orchestration Patterns
10. Event-Driven Architecture
Pattern: Pipelines trigger based on events, not schedules.
Implementation:
# S3 event triggers Lambda
# AWS Lambda function triggered on S3 upload
import boto3
def lambda_handler(event, context):
"""Triggered when file lands in S3"""
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
# Trigger processing
if key.startswith('raw/orders/'):
trigger_order_processing(bucket, key)
elif key.startswith('raw/users/'):
trigger_user_processing(bucket, key)
# Airflow with sensor
from airflow.sensors.s3 import S3KeySensor
s3_sensor = S3KeySensor(
task_id='wait_for_file',
bucket_name='data-bucket',
bucket_key='raw/orders/{{ ds }}.csv',
poke_interval=300 # Check every 5 minutes
)
process_task = PythonOperator(
task_id='process',
python_callable=process_orders
)
s3_sensor >> process_task
When to use:
- Data arrives irregularly
- Want to process immediately when data available
- Reduce unnecessary polling
11. Fan-Out/Fan-In Pattern
Pattern: Parallel processing then aggregation.
Implementation:
# Airflow dynamic task generation
from airflow import DAG
from airflow.operators.python import PythonOperator
def fan_out_tasks():
"""Generate one task per partition"""
partitions = get_partitions() # [2024-01-01, 2024-01-02, ...]
tasks = []
for partition in partitions:
task = PythonOperator(
task_id=f'process_{partition}',
python_callable=process_partition,
op_kwargs={'partition': partition}
)
tasks.append(task)
return tasks
def fan_in_task(**context):
"""Combine results from all partitions"""
results = []
for partition in get_partitions():
task_id = f'process_{partition}'
result = context['ti'].xcom_pull(task_ids=task_id)
results.append(result)
# Combine all results
final_result = combine(results)
return final_result
with DAG('fan_out_fan_in') as dag:
start = DummyOperator(task_id='start')
# Fan out
process_tasks = fan_out_tasks()
# Fan in
combine_task = PythonOperator(
task_id='combine',
python_callable=fan_in_task
)
start >> process_tasks >> combine_task
Cost Optimization Patterns
12. Partitioning Strategy
Pattern: Organize data to minimize scanning.
Implementation:
# Write partitioned data
df.write \
.partitionBy('year', 'month', 'day') \
.parquet('s3://bucket/events/')
# File structure:
# events/year=2024/month=01/day=01/data.parquet
# events/year=2024/month=01/day=02/data.parquet
# Query only relevant partitions
SELECT * FROM events
WHERE year = 2024 AND month = 1 AND day = 15
# Scans only: events/year=2024/month=01/day=15/
# Instead of: entire events/ directory
# Cost: $0.05 vs $50 (1000x cheaper)
13. Data Tiering (Hot/Warm/Cold)
Pattern: Move old data to cheaper storage.
Implementation:
# S3 lifecycle policies
s3_lifecycle = {
'Rules': [{
'Id': 'tier-old-data',
'Status': 'Enabled',
'Transitions': [
{
'Days': 30,
'StorageClass': 'STANDARD_IA' # Infrequent Access
},
{
'Days': 90,
'StorageClass': 'GLACIER' # Archive
}
]
}]
}
s3.put_bucket_lifecycle_configuration(
Bucket='data-lake',
LifecycleConfiguration=s3_lifecycle
)
# Snowflake time travel + auto-clustering
ALTER TABLE events SET DATA_RETENTION_TIME_IN_DAYS = 7;
ALTER TABLE events CLUSTER BY (event_date);
Modern Patterns
14. Medallion Architecture (Bronze/Silver/Gold)
Pattern: Layer data by refinement level.
Architecture:
Bronze (Raw) → Silver (Cleaned) → Gold (Business)
Raw JSON/CSV Validated Aggregated
No transformation Deduplicated Joined
All data Typed columns Analytics-ready
Implementation:
# Bronze: Raw ingestion
bronze_df = spark.read.json('s3://landing/events/')
bronze_df.write.format('delta').save('s3://bronze/events/')
# Silver: Clean and validate
silver_df = spark.read.format('delta').load('s3://bronze/events/') \
.dropDuplicates(['event_id']) \
.filter(col('event_date').isNotNull()) \
.withColumn('event_timestamp', to_timestamp('event_time'))
silver_df.write.format('delta').save('s3://silver/events/')
# Gold: Business logic
gold_df = spark.read.format('delta').load('s3://silver/events/') \
.join(users, 'user_id') \
.groupBy('date', 'user_segment') \
.agg(
count('event_id').alias('event_count'),
sum('revenue').alias('total_revenue')
)
gold_df.write.format('delta').save('s3://gold/daily_metrics/')
When to use:
- Data lakehouse architecture
- Multiple consumers with different needs
- Want clear data quality zones
15. Reverse ETL
Pattern: Push warehouse data back to operational systems.
Implementation:
# Sync warehouse segment to Salesforce
from census import Census
census = Census(api_key='...')
# Define sync
sync = census.create_sync(
source={
'connection': 'snowflake',
'query': '''
SELECT
email,
total_lifetime_value,
last_purchase_date,
customer_segment
FROM analytics.customer_360
WHERE total_lifetime_value > 10000
'''
},
destination={
'connection': 'salesforce',
'object': 'Contact',
'mappings': {
'email': 'Email',
'total_lifetime_value': 'Lifetime_Value__c',
'customer_segment': 'Segment__c'
}
},
schedule='daily'
)
When to use:
- Activate warehouse insights in operational systems
- Personalization, marketing automation
- Close the loop from analytics to operations
Choosing the Right Pattern
Decision framework:
By latency requirement:
- < 1 second: True streaming (Flink, Kafka Streams)
- < 15 minutes: Micro-batch (Spark Streaming)
- < 1 hour: Frequent batch
- Daily+: Traditional batch (ETL/ELT)
By data volume:
- < 100 GB: Single-machine processing
- 100 GB — 10 TB: Distributed batch (Spark)
- 10 TB: Partitioned lakehouse (Delta, Iceberg)
By team skills:
- SQL-focused: ELT pattern (dbt)
- Python-focused: ETL pattern (Airflow + pandas)
- Mixed: Modern data stack (Fivetran + dbt)
By cost sensitivity:
- Budget-constrained: ELT (warehouse compute is predictable)
- Scale-constrained: Data lake (cheaper storage)
The Bottom Line
Architecture patterns are templates, not rules.
Start simple:
- Begin with ELT if using cloud warehouse
- Add incremental processing when full refresh gets slow
- Implement data quality checks early
- Partition as data grows
- Add streaming only when batch latency insufficient
Common mistakes:
- Over-engineering early (Lambda architecture for 100 GB data)
- Under-engineering late (daily batch when real-time needed)
- Ignoring cost (full table scans on petabyte data)
- No monitoring (can’t debug what you can’t see)
Best practices:
- Make pipelines idempotent (reruns produce same result)
- Checkpoint progress (can resume from failure)
- Monitor everything (data volume, latency, cost, quality)
- Version control (infrastructure as code)
- Test thoroughly (unit, integration, data quality)
The best architecture is the simplest one that meets requirements. The second-best is the one you can actually maintain.
Good architecture is invisible when it works and obvious when it doesn’t. Start simple, scale when needed.