Distributions: Every Dataset Has a Shape

⏱TIME: 14 min

🍽️YIELD: 1 mental library of shapes (normal, uniform, exponential)

📓CHAPTER: S2E5

The Idea

CONCEPT

Three pots labelled NORMAL (bell), UNIFORM (flat slab), EXPONENTIAL (ski slope). A ladle scoops samples into a histogram that slowly traces each pot's curve. Margin: 'data = scoops from a pot you can't see. Modelling = guessing the pot.'

A distribution is the recipe; your dataset is a finite number of scoops. Small samples lie about the pot — the histogram only converges with volume.

Normal shows up wherever many small independent effects add (measurement noise); exponential where you wait for rare events; uniform where you genuinely know nothing.

The Central Limit Theorem says averages of almost anything go normal — which is why the normal earns its name without everything being normal.

Skewed pot? Box-Cox or log-transform before feeding linear models; they assume roughly symmetric noise.

In the Test Kitchen: 10 scoops look like noise, 500 scoops become the purple curve. That gap is why sample size matters.

⚗️ The Test Kitchen

INTERACTIVE LAB

Don't just read the recipe — taste it. Drag, click and break things below.

EXP 01

Ladle Luck

"every dataset is scoops from some pot"

The pot has a true recipe — the purple curve. Each ladle scoop is one random draw from it. A handful of scoops looks like noise; a few hundred and the histogram becomes the curve. Models assume a shape for this curve, so try all three shapes before trusting one.

0 scoops

MEAN μ5.0SPREAD σ1.2

FIG L.4: SAMPLING — THE EMPIRICAL HISTOGRAM CONVERGES TO THE TRUE DENSITY (PURPLE)

The Recipe

CODE

REQUIRED SPICESnormaluniformexponentialCLTsampling

Sampling the three classic pots

import numpy as np

rng = np.random.default_rng(42)
heights = rng.normal(5, 1.2, 10_000)      # bell: errors, heights
waits   = rng.exponential(2.0, 10_000)    # ski slope: time between orders
picks   = rng.uniform(2, 8, 10_000)       # flat: anything equally likely

# CLT party trick: means of ugly samples look normal anyway
means = rng.exponential(2.0, (10_000, 30)).mean(axis=1)
print(means.mean(), means.std())          # ≈ 2.0, ≈ 2/sqrt(30)

NEXT EXPERIMENT →

CODE & CURRY

APPROVED

ML KITCHEN