P(flag|spoiled) is the test's quality; P(spoiled|flag) is what you actually want. Bayes' theorem is the bridge, and the prior is the toll.
When the prior is tiny, false positives from the huge healthy population swamp true positives from the small sick one — the base rate fallacy.
A second independent test on the flagged jars uses the 24% as its new prior: evidence compounds by updating, not by replacing.
Doctors, spam filters and fraud models all live or die by this arithmetic; calibration is just Bayes done honestly.
In the Test Kitchen: drag the prior slider down to 1% and watch a 90% test produce mostly amber false alarms.
Don't just read the recipe — taste it. Drag, click and break things below.
Set the true chance a customer loves the dish, then serve plates. Each tasting is random — but watch the running average: noisy over the first few plates, glued to the dashed line after a few hundred. That is the law of large numbers, and it is why more data beats louder opinions.
FIG L.1: BERNOULLI TRIALS — THE RUNNING AVERAGE CONVERGES TO THE TRUE PROBABILITY
Your freshness test is decent — yet when spoiled jars are rare, most jars it flags are actually fine. Slide the priordown and watch the amber false alarms swamp the red true catches. Bayes' theorem just counts: of all flagged jars, what fraction is genuinely spoiled?
FIG L.2: BAYES' THEOREM — POSTERIOR = TRUE FLAGS ÷ ALL FLAGS. RARE EVENTS MAKE GOOD TESTS LOOK BAD
# P(spoiled | flagged) via Bayes' theorem
prior = 0.05 # 5% of jars are spoiled
sens = 0.90 # test catches 90% of spoiled jars
spec = 0.85 # test clears 85% of fine jars
p_flag = prior * sens + (1 - prior) * (1 - spec)
posterior = prior * sens / p_flag
print(f"{posterior:.1%}") # 24.0% — most flags are false alarms!