Accuracy collapses four numbers into one and lies whenever classes are imbalanced — a 99%-fresh kitchen gets 99% accuracy by inspecting nothing.
Precision asks "of the dishes I condemned, how many deserved it?"; recall asks "of the truly spoiled, how many did I catch?" They pull the threshold in opposite directions.
F1 is their harmonic truce; AUC-ROC scores the model across every threshold at once, separating model quality from threshold policy.
Pick the metric from the cost of each mistake: missed spoilage (FN) poisons customers, false alarms (FP) waste food. The matrix is a menu of consequences.
In the Test Kitchen: shrink the model separation and watch every threshold become a bad compromise — metrics cannot rescue a weak model.
Don't just read the recipe — taste it. Drag, click and break things below.
Teal dishes are fresh, red ones are spoiled; the model gives each a suspicion score. The inspector's threshold turns scores into verdicts. Slide it left and you catch every spoiled dish but condemn good food (recall ↑, precision ↓); slide it right and the reverse. Shrink the separation to feel why a weak model makes the trade-off brutal.
FIG L.6: CONFUSION MATRIX — ONE THRESHOLD, FOUR FATES. AMBER = FALSE ALARMS, DEEP RED = MISSED SPOILAGE
from sklearn.metrics import confusion_matrix, precision_score, recall_score
y_true = [1,1,1,1,0,0,0,0,0,0] # 1 = spoiled
scores = [.9,.8,.6,.4,.7,.5,.3,.2,.2,.1]
for t in (0.35, 0.65): # two inspectors, two thresholds
y_pred = [int(s >= t) for s in scores]
print(t, confusion_matrix(y_true, y_pred).ravel(), # tn fp fn tp
precision_score(y_true, y_pred),
recall_score(y_true, y_pred))