Building AI-Powered Data Quality Checks

The Data Quality Problem

Data quality issues are insidious. A source system changes a date format. A vendor starts sending null values where there used to be defaults. A numeric field starts including outliers that skew aggregations. These problems rarely announce themselves — they surface as "the numbers look wrong" in a stakeholder meeting.

Our existing data quality checks were rule-based: null checks, range validations, referential integrity constraints. They caught the obvious problems but missed:

Gradual distribution shifts
Subtle schema drift (new enum values, changed precision)
Correlation breakdowns between related fields
Seasonal pattern violations

We built an ML-powered quality system to catch what rules couldn't.

Architecture

The system operates as a sidecar to our data pipelines, not inline. This was a deliberate choice — we didn't want quality checks to become a bottleneck or a single point of failure for data delivery.

Data Profiling Service

Runs after each pipeline load completes
Computes statistical profiles: distributions, null rates, cardinality, correlation matrices
Stores profiles in a time-series format for trend analysis

Anomaly Detection Models

Per-dataset models trained on historical profiles
Isolation Forest for multivariate outlier detection across profile metrics
Prophet-based models for time-series metrics with strong seasonal patterns
Ensemble scoring combines multiple signals into a single confidence score

Alert and Triage Layer

Anomalies above confidence threshold trigger alerts
Each alert includes: affected dataset, anomalous metrics, historical context, and severity estimate
Integrates with the team's incident management workflow

Model Choices

Why Isolation Forest?

Isolation Forest works well for our use case because:

It handles high-dimensional data (many profile metrics per dataset)
It's unsupervised — no need to label historical quality issues
It's fast to train and score, suitable for running after every pipeline load
It naturally handles the "everything is normal most of the time" data distribution

Why Not Deep Learning?

We considered autoencoders but decided against them. The additional complexity wasn't justified for our data volume, and Isolation Forest was easier to explain to stakeholders. When an alert fires, engineers need to understand why — "these three metrics are statistically unusual compared to the last 90 days" is more actionable than "the reconstruction error exceeded a threshold."

Implementation Details

Training: Models retrain weekly on a rolling 90-day window. We experimented with longer windows but found that data patterns evolve enough that older history introduced noise.

Feature Engineering: The key profiles we track per dataset:

Row counts and growth rates
Null rates per column
Distinct value counts and cardinality ratios
Numeric distribution statistics (mean, std, percentiles)
String length distributions
Freshness (time since last update)

Threshold Tuning: Rather than a single global threshold, each dataset has calibrated thresholds based on its historical volatility. A dataset that naturally fluctuates needs a higher anomaly threshold than one that's rock-stable.

Operational Results

After six months in production:

Detection rate: Caught 92% of data quality issues before they reached dashboards (validated against manually reported incidents)
False positive rate: ~8% of alerts were non-issues — acceptable for our team's volume
Mean detection time: Issues detected within 20 minutes of pipeline completion, vs. hours or days for manual discovery
Investigation time saved: Each caught issue saved an estimated 2-4 hours of manual investigation

Lessons Learned

Start with profiling, not ML. Statistical profiles alone catch a surprising number of issues. The ML layer adds value for subtle and multivariate anomalies, but profiling is the foundation.
Alert fatigue is the real risk. Tune aggressively for low false positives. One missed real issue is better than training engineers to ignore alerts.
Context in alerts is everything. An alert that says "anomaly detected in dataset X" is useless. An alert that says "null rate in column Y jumped from 0.1% to 15% — last 7 runs were stable at 0.1%" is actionable.
Keep the system decoupled from pipelines. Running quality checks as a sidecar means pipeline failures don't cascade to quality monitoring and vice versa.