Workflow Orchestration at Scale with Airflow

Starting Point

When I joined the project, workflow orchestration was a mix of cron jobs, bash scripts, and manual run books. It technically worked, but:

Nobody had a clear view of what was running and when
Failures required manual investigation and restart
Dependencies between jobs were managed by "just schedule this one an hour after that one"
Adding new workflows meant copying and modifying existing scripts

We needed a proper orchestration system. After evaluating options, we chose Apache Airflow for its flexibility, Python-native DAG definitions, and strong community support.

DAG Design Principles

We established a set of principles early to keep the Airflow deployment manageable as it grew:

1. One DAG Per Logical Workflow

Each DAG represents a single, coherent data workflow. We avoided the temptation to build "mega-DAGs" that orchestrate everything. This kept DAGs understandable and independently deployable.

2. Idempotent Tasks

Every task in every DAG can be safely re-run for any execution date. This means tasks check for existing output, use merge/upsert patterns instead of blind inserts, and never depend on external mutable state.

3. Explicit Dependencies

Cross-DAG dependencies use Airflow's ExternalTaskSensor or a custom event-based trigger mechanism. No more "schedule B one hour after A and hope it's done."

4. Configuration Over Code

DAG parameters (source connections, table names, schedule intervals) live in external configuration. This lets us promote DAGs through environments (dev → staging → prod) without code changes.

Failure Handling

Pipeline failures are normal. What matters is how quickly and cleanly you recover.

Retry Configuration: Each task has sensible retry settings based on its failure profile. API calls retry with exponential backoff. Database loads retry with shorter delays. Long-running Spark jobs get fewer retries with longer intervals.

Alerting: Failures trigger alerts to the responsible team's channel with context: which DAG, which task, which execution date, and a link to the logs. No alert fatigue from non-actionable notifications.

Dead-Letter Patterns: For streaming ingestion tasks, records that fail processing go to a dead-letter store for investigation rather than blocking the entire pipeline.

Operational Practices

DAG Versioning

DAGs are version-controlled alongside the code they orchestrate. A PR that changes a data transformation also updates the corresponding DAG if the orchestration logic needs to change.

Environment Parity

Development, staging, and production Airflow instances are configured identically (except for scale and data connections). This catches orchestration issues before they reach production.

Monitoring Dashboard

A custom monitoring dashboard (separate from Airflow's UI) provides a high-level view: pipeline freshness, success rates, average runtimes, and SLA adherence. This is what stakeholders and on-call engineers look at — not raw Airflow logs.

Scaling Lessons

Use the KubernetesExecutor (or CeleryExecutor) early. The LocalExecutor is fine for development but doesn't scale. We switched at ~100 DAGs and wished we'd done it sooner.
Keep Airflow metadata database healthy. Regular cleanup of old task instances and DAG runs prevents the metadata DB from becoming a bottleneck.
Separate infrastructure from DAG concerns. The platform team manages Airflow infrastructure; data teams own their DAGs.

Outcome

We went from ~50 fragile cron-based jobs to 300+ managed workflows with:

99.2% pipeline success rate (up from ~85%)
Mean time to recovery under 15 minutes for most failures
Full dependency visibility and lineage
Self-service DAG creation for data engineering teams

The key insight: orchestration is infrastructure, not a project. It needs dedicated ownership, operational practices, and continuous investment.