CloudsArk
Model Training Mlops

Model Evaluation in MLOps

Learn how model evaluation works in MLOps and why metrics must be automated before deployment.

Model Evaluation in MLOps

Introduction

Model evaluation is the quality gate between training and deployment. It should be automated, repeatable, and tied to the real cost of wrong predictions.

Why This Matters

Accuracy alone can be misleading. Many systems care more about precision, recall, F1, thresholds, segment performance, or business cost.

Core Concepts

Core evaluation concepts include validation sets, test sets, confusion matrix, precision, recall, F1, threshold selection, and comparison to the current production model.

Practical Example

A script should emit machine-readable metrics:

from sklearn.metrics import classification_report, confusion_matrix
import json

y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]

report = classification_report(y_true, y_pred, output_dict=True)
matrix = confusion_matrix(y_true, y_pred).tolist()
print(json.dumps({"f1": report["1"]["f1-score"], "confusion_matrix": matrix}, indent=2))
{
  "f1": 0.8,
  "confusion_matrix": [[2, 0], [1, 2]]
}

How This Fits in a Production Workflow

CI/CD can block deployment if metrics fall below the approved threshold. The registry should store the evaluation report with the artifact.

Common Mistakes

  • Evaluating on training data.
  • Optimizing a metric that does not match operational cost.
  • Changing thresholds manually after deployment without tracking.
  • Ignoring latency and resource cost during evaluation.

Quick Checklist

  • Is the test set separate from training?
  • Are precision, recall, and F1 recorded?
  • Is the production threshold documented?
  • Are metrics compared to the current model?
  • Does CI fail when metrics regress?

Summary

Learn how model evaluation works in MLOps and why metrics must be automated before deployment.