← Back to documentation

March 8, 2025

Building AI-Powered Data Quality Checks

How we used machine learning to detect data quality issues before they reached production dashboards — architecture, model choices, and operational results.

AIData QualityMachine LearningData Pipelines

The Data Quality Problem

Data quality issues are insidious. A source system changes a date format. A vendor starts sending null values where there used to be defaults. A numeric field starts including outliers that skew aggregations. These problems rarely announce themselves — they surface as "the numbers look wrong" in a stakeholder meeting.

Our existing data quality checks were rule-based: null checks, range validations, referential integrity constraints. They caught the obvious problems but missed:

  • Gradual distribution shifts
  • Subtle schema drift (new enum values, changed precision)
  • Correlation breakdowns between related fields
  • Seasonal pattern violations

We built an ML-powered quality system to catch what rules couldn't.

Architecture

The system operates as a sidecar to our data pipelines, not inline. This was a deliberate choice — we didn't want quality checks to become a bottleneck or a single point of failure for data delivery.

Data Profiling Service

  • Runs after each pipeline load completes
  • Computes statistical profiles: distributions, null rates, cardinality, correlation matrices
  • Stores profiles in a time-series format for trend analysis

Anomaly Detection Models

  • Per-dataset models trained on historical profiles
  • Isolation Forest for multivariate outlier detection across profile metrics
  • Prophet-based models for time-series metrics with strong seasonal patterns
  • Ensemble scoring combines multiple signals into a single confidence score

Alert and Triage Layer

  • Anomalies above confidence threshold trigger alerts
  • Each alert includes: affected dataset, anomalous metrics, historical context, and severity estimate
  • Integrates with the team's incident management workflow

Model Choices

Why Isolation Forest?

Isolation Forest works well for our use case because:

  • It handles high-dimensional data (many profile metrics per dataset)
  • It's unsupervised — no need to label historical quality issues
  • It's fast to train and score, suitable for running after every pipeline load
  • It naturally handles the "everything is normal most of the time" data distribution

Why Not Deep Learning?

We considered autoencoders but decided against them. The additional complexity wasn't justified for our data volume, and Isolation Forest was easier to explain to stakeholders. When an alert fires, engineers need to understand why — "these three metrics are statistically unusual compared to the last 90 days" is more actionable than "the reconstruction error exceeded a threshold."

Implementation Details

Training: Models retrain weekly on a rolling 90-day window. We experimented with longer windows but found that data patterns evolve enough that older history introduced noise.

Feature Engineering: The key profiles we track per dataset:

  • Row counts and growth rates
  • Null rates per column
  • Distinct value counts and cardinality ratios
  • Numeric distribution statistics (mean, std, percentiles)
  • String length distributions
  • Freshness (time since last update)

Threshold Tuning: Rather than a single global threshold, each dataset has calibrated thresholds based on its historical volatility. A dataset that naturally fluctuates needs a higher anomaly threshold than one that's rock-stable.

Operational Results

After six months in production:

  • Detection rate: Caught 92% of data quality issues before they reached dashboards (validated against manually reported incidents)
  • False positive rate: ~8% of alerts were non-issues — acceptable for our team's volume
  • Mean detection time: Issues detected within 20 minutes of pipeline completion, vs. hours or days for manual discovery
  • Investigation time saved: Each caught issue saved an estimated 2-4 hours of manual investigation

Lessons Learned

  1. Start with profiling, not ML. Statistical profiles alone catch a surprising number of issues. The ML layer adds value for subtle and multivariate anomalies, but profiling is the foundation.
  2. Alert fatigue is the real risk. Tune aggressively for low false positives. One missed real issue is better than training engineers to ignore alerts.
  3. Context in alerts is everything. An alert that says "anomaly detected in dataset X" is useless. An alert that says "null rate in column Y jumped from 0.1% to 15% — last 7 runs were stable at 0.1%" is actionable.
  4. Keep the system decoupled from pipelines. Running quality checks as a sidecar means pipeline failures don't cascade to quality monitoring and vice versa.