Model Monitoring in Production¶
Introduction¶
Production model monitoring combines normal service monitoring with ML-specific signals. You need to know whether the API is healthy and whether predictions still make sense.
What Can Go Wrong in Production¶
An inference service can return HTTP 200 while silently producing poor predictions. Inputs may shift, upstream systems may change units, labels may arrive late, or traffic may move to a population the model never saw.
Key Metrics to Monitor¶
Monitor availability, latency, error rate, request volume, model version, input feature distribution, prediction distribution, business metrics, CPU, memory, and GPU usage.
Practical Example¶
Structured inference logs make monitoring possible:
import json, time
log = {
"timestamp": time.time(),
"model_version": "churn:17",
"latency_ms": 18.4,
"features": {"tenure": 12, "monthly_charges": 89.9},
"prediction": 0.73,
}
print(json.dumps(log))
{"model_version":"churn:17","latency_ms":18.4,"prediction":0.73}
Detection Strategy¶
Create separate alerts for service health and model behavior. A 5xx spike is an API incident; a prediction distribution shift may be a data or business change.
Common Mistakes¶
- Monitoring only CPU and memory.
- Not logging model version.
- Collecting feature values without considering privacy.
- Alerting on noisy metrics without baselines.
Quick Checklist¶
- Is model version logged?
- Are latency and error-rate alerts active?
- Are input and prediction distributions tracked?
- Are labels joined later for quality checks?
- Are dashboards reviewed after each release?
Related Guides¶
Summary¶
Learn what to monitor for production ML models, including service health, latency, errors, inputs, predictions, and business outcomes.