Gradient descent is greedy and local: it never sees the map, only the slope where it stands. That is both its power (cheap) and its trap (local minima).
Learning rate is the whole game — too low crawls and stalls in dips, too high overshoots and diverges. Schedulers and Adam exist to tune the flame mid-cook.
SGD trades exact gradients for noisy cheap ones; the noise itself helps hop out of sharp local minima.
Every neural network you will ever train is this chickpea, in a million dimensions.
In the Test Kitchen: try all three flame settings and find the start position where LOW gets trapped but MEDIUM escapes.
Don't just read the recipe — taste it. Drag, click and break things below.
The pot wants the bottom of the valley (minimum loss). Each step follows the slope downhill: x ← x − α·f′(x). The flame is α, the learning rate — SIMMER takes steps so small it gets stuck in the pothole, MEDIUM powers through to the true bottom, and FULL BLAST overshoots harder every step until the pot flies off. Click anywhere on the curve to drop the pot there.
FIG V.3: GRADIENT DESCENT — RED ARROW POINTS DOWNHILL (MINUS THE GRADIENT)
def f(x): return 0.15*(x-1.8)**2 + 0.3 # the hill (loss)
def df(x): return 0.30*(x-1.8) # its slope (gradient)
x, lr = -3.0, 0.5 # start + flame setting
for step in range(40):
x -= lr * df(x) # walk against the slope
print(round(x, 3), round(f(x), 4)) # → 1.8, the minimum