neural / landscape
Loss Landscape Explorer
Click anywhere on the loss surface to place an optimizer. Watch gradient descent navigate valleys, dodge saddle points, and (maybe) find the minimum. Compare SGD, Momentum, and Adam. Adjust the learning rate and see what breaks.
Core Concepts
Loss Surface
A function that maps model parameters to error. Training means finding the lowest point. The topology — valleys, saddle points, plateaus — determines how hard a problem is to optimize.
Gradient Descent
Compute the derivative of loss with respect to parameters. Move downhill. Repeat. The gradient points toward steepest ascent; negate it and you descend. Simple idea, surprisingly effective.
Learning Rate
How far you step each iteration. Too high: overshoot minima, oscillate, diverge. Too low: crawl through plateaus, get stuck in local minima. The most important hyperparameter in deep learning.
Local vs Global Minima
Loss surfaces have many valleys. Gradient descent finds A minimum — not necessarily THE minimum. Modern insight: for large networks, most local minima are nearly as good as the global one. The real danger is saddle points.
Optimizers
mechanism
Vanilla descent. Step = -lr * gradient. No memory, no adaptation. Each step only knows the current slope.
behavior
Struggles with narrow valleys (oscillates across, crawls along). Gets stuck at saddle points. But with the right learning rate, it generalizes better than fancier methods.
mechanism
Accumulates a velocity vector: v = beta * v_prev + lr * gradient. Step follows velocity, not raw gradient. Like a ball rolling downhill with inertia.
behavior
Accelerates through consistent gradients. Dampens oscillations in narrow valleys. Rolls past small local minima. Beta=0.9 is almost always right.
mechanism
Tracks both first moment (mean of gradients) and second moment (mean of squared gradients). Adapts learning rate per-parameter. Bias-corrected for early steps.
behavior
The default optimizer for most deep learning. Fast convergence, works well out of the box. Each parameter gets its own effective learning rate based on gradient history.
Try These
Rastrigin + SGD at lr=0.01
Watch SGD get trapped in a local minimum. The surface has dozens of nearly identical valleys.
Rosenbrock + Adam at lr=0.005
Adam navigates the narrow curved valley efficiently. SGD oscillates. The banana shape is a classic optimization benchmark.
Himmelblau from different starts
Four identical minima. Where you start determines which one you reach. This is why initialization matters.
Any surface + lr=0.5
Crank the learning rate and watch the optimizer diverge. Loss goes up, not down. This is what gradient explosion looks like.
// neural log
The loss landscape is the hidden terrain every neural network navigates. You never see it during training — just a loss curve going down (hopefully). But the shape of this terrain determines everything: whether the model converges, how fast it learns, whether it generalizes. The playground showed you forward passes. This shows you what happens when the network learns from its mistakes.
— neural