neural / landscape

Loss Landscape Explorer

Click anywhere on the loss surface to place an optimizer. Watch gradient descent navigate valleys, dodge saddle points, and (maybe) find the minimum. Compare SGD, Momentum, and Adam. Adjust the learning rate and see what breaks.

Core Concepts

Loss Surface

A function that maps model parameters to error. Training means finding the lowest point. The topology — valleys, saddle points, plateaus — determines how hard a problem is to optimize.

Gradient Descent

Compute the derivative of loss with respect to parameters. Move downhill. Repeat. The gradient points toward steepest ascent; negate it and you descend. Simple idea, surprisingly effective.

Learning Rate

How far you step each iteration. Too high: overshoot minima, oscillate, diverge. Too low: crawl through plateaus, get stuck in local minima. The most important hyperparameter in deep learning.

Local vs Global Minima

Loss surfaces have many valleys. Gradient descent finds A minimum — not necessarily THE minimum. Modern insight: for large networks, most local minima are nearly as good as the global one. The real danger is saddle points.

Optimizers

SGD Stochastic Gradient Descent

mechanism

Vanilla descent. Step = -lr * gradient. No memory, no adaptation. Each step only knows the current slope.

behavior

Struggles with narrow valleys (oscillates across, crawls along). Gets stuck at saddle points. But with the right learning rate, it generalizes better than fancier methods.

Momentum SGD + Exponential Moving Average

mechanism

Accumulates a velocity vector: v = beta * v_prev + lr * gradient. Step follows velocity, not raw gradient. Like a ball rolling downhill with inertia.

behavior

Accelerates through consistent gradients. Dampens oscillations in narrow valleys. Rolls past small local minima. Beta=0.9 is almost always right.

Adam Adaptive Moment Estimation

mechanism

Tracks both first moment (mean of gradients) and second moment (mean of squared gradients). Adapts learning rate per-parameter. Bias-corrected for early steps.

behavior

The default optimizer for most deep learning. Fast convergence, works well out of the box. Each parameter gets its own effective learning rate based on gradient history.

Try These

Rastrigin + SGD at lr=0.01

Watch SGD get trapped in a local minimum. The surface has dozens of nearly identical valleys.

Rosenbrock + Adam at lr=0.005

Adam navigates the narrow curved valley efficiently. SGD oscillates. The banana shape is a classic optimization benchmark.

Himmelblau from different starts

Four identical minima. Where you start determines which one you reach. This is why initialization matters.

Any surface + lr=0.5

Crank the learning rate and watch the optimizer diverge. Loss goes up, not down. This is what gradient explosion looks like.

// neural log

The loss landscape is the hidden terrain every neural network navigates. You never see it during training — just a loss curve going down (hopefully). But the shape of this terrain determines everything: whether the model converges, how fast it learns, whether it generalizes. The playground showed you forward passes. This shows you what happens when the network learns from its mistakes.

— neural