Data Drift Explained¶
Introduction¶
Data drift happens when production input data no longer looks like the data used to train the model. The model may still run, but its assumptions are weaker.
What Can Go Wrong in Production¶
Feature distributions can shift because user behavior changes, upstream systems change formats, a new region is added, or a pipeline starts sending defaults instead of real values.
Key Metrics to Monitor¶
Track missing-value rates, numeric percentiles, categorical frequencies, out-of-range values, schema changes, and prediction distribution.
Practical Example¶
A simple numeric drift check compares training and production means:
import pandas as pd
train = pd.Series([50, 55, 60, 65, 70])
prod = pd.Series([80, 85, 90, 95, 100])
train_mean = train.mean()
prod_mean = prod.mean()
change = abs(prod_mean - train_mean) / train_mean
print({"train_mean": train_mean, "prod_mean": prod_mean, "relative_change": round(change, 2)})
{'train_mean': 60.0, 'prod_mean': 90.0, 'relative_change': 0.5}
Detection Strategy¶
Baseline production features against the training dataset and recent production windows. Alert on large changes, then inspect whether the change is expected or a pipeline bug.
Common Mistakes¶
- Treating any drift alert as automatic model failure.
- Not separating real business change from broken input pipelines.
- Monitoring only predictions and not input features.
- Using thresholds without historical baselines.
Quick Checklist¶
- Are training feature distributions stored?
- Are production features logged safely?
- Are missing and out-of-range values tracked?
- Are drift alerts reviewed with data owners?
- Is retraining tied to evidence?
Related Guides¶
Summary¶
Understand data drift, how production inputs can change compared with training data, and practical ways to detect it.