Chapter 6 · Part II — Linear models

Gradient descent

Finding the bottom of a hill, blindfolded. The algorithm that quietly powers nearly everything else in this book.

Picture being dropped onto a hilly landscape, blindfolded. Your job: find the lowest point. You can't see anything. But you can feel the slope of the ground beneath your feet. So you take a small step in the steepest downhill direction, feel again, take another step. After many steps, you arrive somewhere low.

That algorithm — follow the slope down, repeatedly — is gradient descent. And almost every machine learning model you'll meet for the rest of this book is trained using some variant of it.

1. The optimisation problem

Most of supervised machine learning has the same shape underneath. There is a model with parameters θ\theta. There is a loss function L(θ)L(\theta) that scores how badly the model fits the training data at the current θ\theta — high loss is bad, low loss is good. The job is to find the θ\theta that minimises LL:

θ=argminθL(θ)\theta^* = \arg\min_\theta L(\theta)

That's it. ML is mostly optimisation in a costume.

The previous chapter introduced linear regression, and showed that for that specific loss you can solve the optimisation in closed form — write down the normal equations, solve a linear system, done. But linear regression is the exception. For most models — logistic regression with non-trivial regularisation, neural networks, almost anything interesting — there is no closed-form solution. The loss is a function that has no nice analytical minimum. We need an algorithm that searches for the minimum iteratively.

Gradient descent is the search algorithm of choice. It is not the only option (Newton's method, BFGS, evolutionary search, and others all exist), but it has a property the others lack: it scales. The cost per iteration stays manageable even when θ\theta has billions of components, which is why it powers everything from linear models to LLMs.

2. Following the gradient

The strategy is direct. Start somewhere in parameter space. Look at the slope of the loss at that point. Take a step in the downhill direction. Repeat.

choose initial θ
repeat:
    g ← gradient of L at θ
    θ ← θ - α · g
until converged

The parameter α>0\alpha > 0 is the learning rate — how big a step we take in the downhill direction. We will spend much of this chapter understanding α\alpha.

In one dimension this is easy to visualise. The "gradient" of a 1D function is just its derivative — the slope at a point. Positive slope means the function is going up to the right; negative slope means it is going down. To go down, we step in the direction opposite to the slope.

x
step 0 / 37
x = 2.4000
f(x) = 5.7600
f′(x) = 4.8000
Figure 6.1 — Gradient descent on a 1D function. The dot is the current x; the line through it is the tangent (its slope is the gradient); the horizontal arrow shows the next step, which is −α times the slope.

A few things to notice as you play with it:

The last point is one of the central tensions in gradient-based methods: you only see the part of the surface you happen to be on. Gradient descent is a local algorithm. It knows nothing about the global shape of the loss.

3. The gradient: what it actually is

In one dimension we called it "the slope." In higher dimensions the right word is gradient, and it is a vector.

Suppose L(θ)L(\theta) is a function of a parameter vector θ=(θ1,θ2,,θp)\theta = (\theta_1, \theta_2, \dots, \theta_p). The gradient is the vector of partial derivatives:

L(θ)=[Lθ1,  Lθ2,  ,  Lθp]\nabla L(\theta) = \left[ \frac{\partial L}{\partial \theta_1},\; \frac{\partial L}{\partial \theta_2},\; \dots,\; \frac{\partial L}{\partial \theta_p} \right]

Each component tells you how LL changes when you nudge just that one parameter, holding the others fixed.

Here is the key fact, and it is worth pausing on: the gradient points in the direction of steepest ascent. Among all directions you could walk from the current θ\theta, the gradient direction is the one that increases LL the fastest. So to decrease LL the fastest, walk in the opposite direction: L-\nabla L.

The update rule for gradient descent, in full vector form:

θθαL(θ)\theta \leftarrow \theta - \alpha \nabla L(\theta)

A worked example. Take L(θ)=θ12+θ22L(\theta) = \theta_1^2 + \theta_2^2 — a paraboloid bowl with its minimum at the origin. The partial derivatives:

Lθ1=2θ1,Lθ2=2θ2\frac{\partial L}{\partial \theta_1} = 2\theta_1, \quad \frac{\partial L}{\partial \theta_2} = 2\theta_2

So L(θ)=(2θ1,2θ2)\nabla L(\theta) = (2\theta_1, 2\theta_2). At the point θ=(1,3)\theta = (1, 3) the gradient is (2,6)(2, 6) — it points up and to the right, away from the minimum at the origin. To descend we step in the direction (2,6)(-2, -6), scaled by the learning rate.

Below is the same idea in 2D, animated. Click anywhere on the surface to choose a starting position. Watch the trajectory descend.

Click anywhere on the surface to choose a starting position.
step 0 / 80
θ = (3.500, 1.100)
loss = 7.9025
‖∇L‖ = 8.9723
Figure 6.2 — Gradient descent on a 2D loss surface. Contours are level sets of the loss; the ring marks a minimum. Each dot along the path is one step. Click anywhere to relocate the start.

A few things to play with:

4. Choosing a learning rate

The learning rate α\alpha is the single most consequential hyperparameter in gradient-based optimisation. Set it well and the algorithm converges quickly to a good minimum. Set it badly and it either crawls along forever or explodes to infinity.

The danger of too small:

The danger of too large:

You can see all of this in one widget. Same problem, same starting position, four different learning rates running in parallel:

lr = 0.009 (×0.3)lr = 0.030 (×1)lr = 0.075 (×2.5)lr = 0.150 (×5)
step 0 / 80
Figure 6.3 — Four learning rates on the same problem, same starting position. The smallest crawls; the middle ones converge cleanly; the largest either oscillates or diverges entirely.

The smallest learning rate barely moves. The largest overshoots and (for high enough values) wanders or diverges. There is usually a Goldilocks band of rates that converge reasonably — and within that band, larger rates converge faster, up to the point where overshooting starts to dominate.

In practice, three strategies are common:

  1. Grid search. Try a small list of learning rates — typically {104,103,102,101}\{10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}\} — and pick the best one on a held-out validation set.
  2. Adaptive methods. Use an optimiser like Adam (§7) that adjusts the effective rate per-parameter automatically. This makes the choice of α\alpha much less sensitive.
  3. Schedules. Use a learning-rate schedule that starts large and decreases over training. Common choices: step decay (drop by a factor every kk epochs), cosine annealing, or a brief warmup followed by decay. Standard in deep learning.

5. Batch, stochastic, and mini-batch

So far we have assumed the gradient is computed using the entire dataset. For an ML loss of the form

L(θ)=1ni=1n(xi,yi;θ)L(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(x_i, y_i; \theta)

the full-batch gradient is

L(θ)=1ni=1n(xi,yi;θ).\nabla L(\theta) = \frac{1}{n} \sum_{i=1}^n \nabla \ell(x_i, y_i; \theta).

When nn is small, this is fine. When nn is in the millions? Each gradient evaluation requires touching every training example. A single optimisation step takes seconds; running for thousands of steps takes hours or days.

The fix is stochastic gradient descent (SGD). Instead of computing the gradient over all nn examples, compute it over just one, chosen at random each step:

g=(xi,yi;θ)for a single random i.g = \nabla \ell(x_i, y_i; \theta) \quad \text{for a single random } i.

This single-example gradient is a noisy estimate of the full-batch gradient. It points roughly in the right direction but with a lot of variance from one step to the next. The trajectory is therefore noisy too — it zig-zags toward the minimum rather than gliding smoothly.

But it is cheap. Each step costs the same as evaluating the loss on one example. We can take many noisy steps in the time one full-batch step would take, and on most problems the cumulative progress is faster.

The compromise — and what virtually everyone actually uses — is mini-batch SGD: compute the gradient over a small random subset of the data, typically 32, 64, or 256 examples. Less noisy than pure SGD, vastly cheaper than full-batch, and convenient to vectorise on a GPU.

There is a subtle bonus: the noise in SGD can actually help. Noise lets the algorithm jiggle out of bad local minima and saddle points that vanilla gradient descent would get stuck on. The noise is a feature, not just a side effect.

6. Momentum

There is a particular failure mode of vanilla gradient descent worth seeing in detail.

Imagine a loss surface that is a long, narrow valley — much steeper across the short direction than along the long one (the valley preset in the widget above is exactly this). Vanilla gradient descent oscillates back and forth across the narrow direction at each step, making only slow progress along the long direction. The trajectory looks like a zigzag. You can see this clearly: switch to valley and crank the learning rate toward the upper limit.

The fix is momentum. Instead of stepping in the gradient direction, step in the direction of a running average of recent gradients. The update becomes:

vβv+gv \leftarrow \beta v + g θθαv\theta \leftarrow \theta - \alpha v

where β[0,1)\beta \in [0, 1) is the momentum coefficient. A typical value is β=0.9\beta = 0.9, meaning each step's velocity is 90% inherited from the previous step's velocity plus 100% of the current gradient.

The intuition is in the name. Imagine a ball rolling down the loss surface. Without momentum, the ball's direction is reset at each step to whatever the local gradient says — including the irrelevant oscillating component across the valley. With momentum, the ball builds up speed in directions that are consistent over time. In the valley case the consistent direction is along the valley, not across it: the oscillations cancel out across consecutive steps, while the steady forward component compounds.

Momentum dramatically accelerates convergence on ill-conditioned problems. It has been a default ingredient in deep-learning optimisers for over a decade. The version above is "heavy-ball" momentum; a slightly more sophisticated variant called Nesterov momentum looks one step ahead before computing the gradient and often performs marginally better.

7. Adaptive learning rates

A separate idea: instead of one global learning rate α\alpha, give every parameter its own effective rate, adjusted automatically based on the history of its gradients.

The motivation is that in deep models, different parameters have wildly different gradient magnitudes. Some parameters see huge gradients throughout training; others see tiny ones. A learning rate that is right for the big-gradient parameters explodes the small-gradient ones, and vice versa. A single global α\alpha has to be a poor compromise.

The three families you will meet in code:

AdaGrad keeps a running sum of squared gradients per parameter and scales each parameter's update by the inverse square root of that sum. Parameters that have seen large gradients accumulate large sums and therefore get smaller updates. Elegant in theory; in practice the running sum grows monotonically and eventually the effective rate decays to zero, so AdaGrad eventually freezes.

RMSprop fixes AdaGrad's monotonic-decay problem by replacing the running sum with an exponential moving average of squared gradients. The effective rate adapts to recent history rather than all history, so the optimiser keeps moving indefinitely.

Adam combines RMSprop with momentum, plus a small bias-correction term. It is the default modern choice — robust, requires little tuning, and usually works out of the box. The vast majority of deep-learning papers in the last five years use Adam or one of its close relatives.

We will not implement these in detail here; they are a standard part of every deep-learning library. The point is that when you see torch.optim.Adam in code, you now know what is inside it: per-parameter adaptive rates plus momentum.

8. Where gradient descent struggles

Gradient descent is not magic. A short tour of the common failure modes.

Saddle points. In high dimensions, most points where the gradient is zero are not minima — they are saddles, where the surface curves up in some directions and down in others. The gradient at a saddle is exactly zero, so vanilla gradient descent slows to a crawl, but the loss is not actually minimised. Momentum helps escape saddles by carrying velocity through the flat region; stochastic noise helps too. Try the saddle preset above, starting from near the yy-axis, to see the slowdown.

Plateaus. Regions of the loss surface where the gradient is consistently small but non-zero. Gradient descent makes very slow progress through them. Adam-style adaptive methods handle plateaus better than vanilla GD because they boost the effective rate when gradients are persistently small.

Local minima. For non-convex losses, gradient descent converges to whatever local minimum is closest to where it started. Try the Himmelblau preset above with various starting positions to see this in action. The good news for deep learning: in very high dimensions, the loss surface has so many local minima that most of them have similar values, and the catastrophically bad ones are exponentially rare. Local-minima panic was a 20th-century concern.

Sharp vs flat minima. When gradient descent lands in a narrow, deep crevice in the loss surface, the resulting model often generalises poorly compared to one that landed in a wide, shallow basin at similar loss. Some optimisers (Sharpness-Aware Minimisation, SAM) explicitly seek flat minima. This is an active research area.

9. Complexity

Per-step cost. One gradient evaluation. For an ML loss with nn examples and pp parameters, full-batch gradient descent costs O(np)O(np) per step — every example touches every parameter. Mini-batch SGD with batch size BB costs O(Bp)O(Bp) per step, which is much less when BnB \ll n.

Steps to converge. This depends on:

For well-conditioned convex problems, gradient descent converges in roughly O(log(1/ε))O(\log(1/\varepsilon)) steps to error ε\varepsilon. For ill-conditioned ones it can be much slower — the number of steps scales with the condition number of the problem, which can be enormous. Momentum reduces this to scaling with the square root of the condition number, which is the main reason it matters.

10. Implementing it yourself

Vanilla gradient descent on the valley loss we visualised in §3, in NumPy. Try changing the learning rate or starting point and rerun.

Python · runs in browser
import numpy as np

def loss(theta):
  x, y = theta
  return 0.25 * x**2 + 4 * y**2

def grad(theta):
  x, y = theta
  return np.array([0.5 * x, 8 * y])

# Initialise
theta = np.array([3.5, 1.1])
lr = 0.05
history = [theta.copy()]

# Iterate
for step in range(60):
  g = grad(theta)
  theta = theta - lr * g
  history.append(theta.copy())

  if np.linalg.norm(g) < 1e-4:
      print(f"Converged at step {step + 1}")
      break

print(f"Final theta = {theta}")
print(f"Final loss  = {loss(theta):.6f}")
print(f"Steps taken = {len(history) - 1}")

The pattern is the same every time: evaluate the gradient, update the parameters by αL-\alpha \nabla L, check for convergence, repeat. The sophisticated optimisers in the previous sections all share this skeleton — they differ only in how they compute the step.

11. Problems

Problem 1 — Why does gradient descent slow down near the minimum?

A conceptual question. No code. Think about what the gradient is in terms of the geometry of the loss surface, and what happens to it as you approach a minimum.

Show solution

Because the gradient is the slope. Near a minimum the surface is nearly flat, so the slope is small. The update step is α\alpha times the gradient, so a small gradient produces a small step. This is a feature, not a bug: the algorithm automatically takes large steps when it is far from the goal and small steps when it is close, all without anyone having to tell it explicitly.

A subtle consequence: gradient descent will never reach the minimum in finite time on a smooth convex loss. It approaches asymptotically. In practice we stop when the gradient norm falls below some tolerance ε\varepsilon, which gets us close enough.

Problem 2 — Implement vanilla gradient descent (1D)

Fill in the loop for f(x)=(x3)2+5f(x) = (x - 3)^2 + 5. The derivative is f(x)=2(x3)f'(x) = 2(x - 3), and the minimum is at x=3x = 3 with f(3)=5f(3) = 5. Start at x=0x = 0, use α=0.1\alpha = 0.1, run 30 steps, and print every 5 steps.

Python · runs in browser
# Gradient descent on f(x) = (x - 3)^2 + 5

def f(x):
  return (x - 3)**2 + 5

def df(x):
  return 2 * (x - 3)

x = 0.0
lr = 0.1

for step in range(30):
  # 1. Compute the gradient
  # 2. Update x
  # 3. Print step, x, f(x) every 5 steps
  pass
Show solution
Python · runs in browser
def f(x):
  return (x - 3)**2 + 5

def df(x):
  return 2 * (x - 3)

x = 0.0
lr = 0.1

for step in range(30):
  g = df(x)
  x = x - lr * g
  if step % 5 == 0:
      print(f"step {step:2d}: x = {x:.4f}, f(x) = {f(x):.4f}")

print(f"\nFinal x = {x:.6f}")
print(f"True min at x = 3.0")

The algorithm converges geometrically — each step closes the gap to the minimum by a constant factor of (12α)(1 - 2\alpha).

Problem 3 — Generalise to 2D with NumPy vectors

Implement gradient descent on f(x,y)=x2+4y2f(x, y) = x^2 + 4y^2. Start at θ=(4,1)\theta = (4, 1), use α=0.1\alpha = 0.1, run 40 steps, print every 5 steps.

Python · runs in browser
import numpy as np

def f(theta):
  x, y = theta
  return x**2 + 4 * y**2

def grad(theta):
  x, y = theta
  return np.array([2 * x, 8 * y])

theta = np.array([4.0, 1.0])
lr = 0.1

# Your code: 40 steps of GD. Print step, theta, loss every 5 steps.
Show solution
Python · runs in browser
import numpy as np

def f(theta):
  x, y = theta
  return x**2 + 4 * y**2

def grad(theta):
  x, y = theta
  return np.array([2 * x, 8 * y])

theta = np.array([4.0, 1.0])
lr = 0.1

for step in range(40):
  g = grad(theta)
  theta = theta - lr * g
  if step % 5 == 0:
      print(f"step {step:2d}: theta = {theta}, loss = {f(theta):.4f}")

print(f"\nFinal theta = {theta}")

Notice the trajectory in yy converges almost immediately (gradient magnitude in yy is 8 times that in xx), while xx takes much longer. This is the ill-conditioning we discussed in §6 — and it is exactly what momentum is designed to fix.

Problem 4 — Add momentum

Modify the optimiser from Problem 3 to use heavy-ball momentum with β=0.9\beta = 0.9. The update is:

vβv+g,θθαvv \leftarrow \beta v + g, \qquad \theta \leftarrow \theta - \alpha v

You may need to lower the learning rate compared to Problem 3 — momentum amplifies effective step size.

Python · runs in browser
import numpy as np

def f(theta):
  x, y = theta
  return x**2 + 4 * y**2

def grad(theta):
  x, y = theta
  return np.array([2 * x, 8 * y])

theta = np.array([4.0, 1.0])
lr = 0.05
beta = 0.9
v = np.zeros(2)

# Your code: 40 steps with momentum. Print loss every 5 steps.
Show solution
Python · runs in browser
import numpy as np

def f(theta):
  x, y = theta
  return x**2 + 4 * y**2

def grad(theta):
  x, y = theta
  return np.array([2 * x, 8 * y])

theta = np.array([4.0, 1.0])
lr = 0.05
beta = 0.9
v = np.zeros(2)

for step in range(40):
  g = grad(theta)
  v = beta * v + g
  theta = theta - lr * v
  if step % 5 == 0:
      print(f"step {step:2d}: loss = {f(theta):.6f}")

print(f"\nFinal loss = {f(theta):.6f}")

Compare the loss trajectory to Problem 3. Momentum reaches a similar loss in many fewer steps — at the cost of a small overshoot near the start, as the accumulated velocity carries past the minimum before decaying.

Problem 5 — Putting it together: linear regression from scratch

Use gradient descent to fit linear regression with no closed form — just GD on the MSE loss. The loss is L(β)=1ni(xiβyi)2L(\beta) = \frac{1}{n}\sum_i (x_i^\top \beta - y_i)^2, with gradient L(β)=2nX(Xβy)\nabla L(\beta) = \frac{2}{n} X^\top (X\beta - y). Generate synthetic data with a known relationship, then recover the coefficients.

Python · runs in browser
import numpy as np

# Generate synthetic data
np.random.seed(42)
n = 200
X = np.random.randn(n, 3)
true_beta = np.array([1.5, -2.0, 0.7])
y = X @ true_beta + 0.1 * np.random.randn(n)

# Fit with gradient descent
beta = np.zeros(3)
lr = 0.05

# Your code:
#  - 200 GD steps on MSE loss
#  - Print loss every 20 steps
#  - At the end, print the recovered beta alongside true_beta
Show solution
Python · runs in browser
import numpy as np

np.random.seed(42)
n = 200
X = np.random.randn(n, 3)
true_beta = np.array([1.5, -2.0, 0.7])
y = X @ true_beta + 0.1 * np.random.randn(n)

def loss(beta):
  residual = X @ beta - y
  return np.mean(residual ** 2)

def grad(beta):
  residual = X @ beta - y
  return (2 / n) * X.T @ residual

beta = np.zeros(3)
lr = 0.05

for step in range(200):
  g = grad(beta)
  beta = beta - lr * g
  if step % 20 == 0:
      print(f"step {step:3d}: loss = {loss(beta):.6f}")

print(f"\nTrue beta:      {true_beta}")
print(f"Recovered beta: {beta}")
print(f"Final loss:     {loss(beta):.6f}")

The recovered β\beta should be very close to the true values. The 0.1-magnitude noise in yy sets a floor on how close we can get — beyond a certain point, "better fit" means fitting the noise, not the signal. You have just trained your first ML model from scratch, with no library help beyond NumPy. Every later chapter is a variant of this same shape: define a loss, compute its gradient, run gradient descent.


Next: Chapter 7 — Logistic regression. The first time we use gradient descent to train a model that actually classifies things.