Data Pipeline Architecture: Building Scalable ETL Systems

In this project I designed a scalable ETL pipeline to process data from multiple sources, transform it, and load it into a data warehouse. The system handles over 5 million records daily.

Data Sources

Data Source Distribution (%)

Structured DB

APIs

Files

Streaming

External

42 % of data comes from structured databases; 28 % from partner APIs.

Pipeline Architecture (Apache Airflow DAG)

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

def extract_data():
    for source in ['postgres', 'api', 's3', 'kafka']:
        extract_from_source(source)

def transform_data():
    df = load_raw_data()
    df = clean_data(df)
    df = validate_schema(df)
    return df

def load_data():
    df = transform_data()
    load_to_warehouse(df)

default_args = {
    'owner':       'data-team',
    'retries':     3,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'etl_pipeline',
    default_args=default_args,
    schedule_interval='@daily',
    start_date=datetime(2024, 1, 1),
)

extract_task   = PythonOperator(task_id='extract',   python_callable=extract_data,   dag=dag)
transform_task = PythonOperator(task_id='transform', python_callable=transform_data, dag=dag)
load_task      = PythonOperator(task_id='load',      python_callable=load_data,      dag=dag)

extract_task >> transform_task >> load_task

Performance Metrics

Stage	Avg Duration	Records Processed	Success Rate
Extract	45 min	5.2M	99.8%
Transform	28 min	5.2M	99.5%
Load	15 min	5.1M	99.9%

Key Learnings

Idempotent operations prevent duplicate data when tasks are retried
Proper error handling with exponential back-off is crucial for upstream API failures
Monitoring and alerting on SLA miss catches issues before they cascade
Partitioning the warehouse by load_date keeps query costs linear as data grows