Data Pipeline Architecture: Building Scalable ETL Systems
Designing and implementing a robust data pipeline using Apache Airflow — handling millions of records daily with error handling, retries, and monitoring.
In this project I designed a scalable ETL pipeline to process data from multiple sources, transform it, and load it into a data warehouse. The system handles over 5 million records daily.
Data Sources
Pipeline Architecture (Apache Airflow DAG)
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
def extract_data():
for source in ['postgres', 'api', 's3', 'kafka']:
extract_from_source(source)
def transform_data():
df = load_raw_data()
df = clean_data(df)
df = validate_schema(df)
return df
def load_data():
df = transform_data()
load_to_warehouse(df)
default_args = {
'owner': 'data-team',
'retries': 3,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'etl_pipeline',
default_args=default_args,
schedule_interval='@daily',
start_date=datetime(2024, 1, 1),
)
extract_task = PythonOperator(task_id='extract', python_callable=extract_data, dag=dag)
transform_task = PythonOperator(task_id='transform', python_callable=transform_data, dag=dag)
load_task = PythonOperator(task_id='load', python_callable=load_data, dag=dag)
extract_task >> transform_task >> load_task
Performance Metrics
| Stage | Avg Duration | Records Processed | Success Rate |
|---|---|---|---|
| Extract | 45 min | 5.2M | 99.8% |
| Transform | 28 min | 5.2M | 99.5% |
| Load | 15 min | 5.1M | 99.9% |
Key Learnings
- Idempotent operations prevent duplicate data when tasks are retried
- Proper error handling with exponential back-off is crucial for upstream API failures
- Monitoring and alerting on SLA miss catches issues before they cascade
- Partitioning the warehouse by
load_datekeeps query costs linear as data grows