← Journal

Data Pipeline Architecture: Building Scalable ETL Systems

Designing and implementing a robust data pipeline using Apache Airflow — handling millions of records daily with error handling, retries, and monitoring.

In this project I designed a scalable ETL pipeline to process data from multiple sources, transform it, and load it into a data warehouse. The system handles over 5 million records daily.

Data Sources

Data Source Distribution (%)
42
Structured DB
28
APIs
15
Files
10
Streaming
5
External
42 % of data comes from structured databases; 28 % from partner APIs.

Pipeline Architecture (Apache Airflow DAG)

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

def extract_data():
    for source in ['postgres', 'api', 's3', 'kafka']:
        extract_from_source(source)

def transform_data():
    df = load_raw_data()
    df = clean_data(df)
    df = validate_schema(df)
    return df

def load_data():
    df = transform_data()
    load_to_warehouse(df)

default_args = {
    'owner':       'data-team',
    'retries':     3,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'etl_pipeline',
    default_args=default_args,
    schedule_interval='@daily',
    start_date=datetime(2024, 1, 1),
)

extract_task   = PythonOperator(task_id='extract',   python_callable=extract_data,   dag=dag)
transform_task = PythonOperator(task_id='transform', python_callable=transform_data, dag=dag)
load_task      = PythonOperator(task_id='load',      python_callable=load_data,      dag=dag)

extract_task >> transform_task >> load_task

Performance Metrics

StageAvg DurationRecords ProcessedSuccess Rate
Extract45 min5.2M99.8%
Transform28 min5.2M99.5%
Load15 min5.1M99.9%

Key Learnings

  • Idempotent operations prevent duplicate data when tasks are retried
  • Proper error handling with exponential back-off is crucial for upstream API failures
  • Monitoring and alerting on SLA miss catches issues before they cascade
  • Partitioning the warehouse by load_date keeps query costs linear as data grows