Chapter 11 · Part III — Evaluating models

Cross-validation and tuning

How to measure a model on data it has never seen, and choose its settings without once peeking at the answer.

A model graded on the very data it learned from is a student who saw the exam in advance. Of course it scores well — it has had every chance to memorise the answers. Chapter 9's degree-fifteen polynomial passed through all twenty-five of its training points and predicted nonsense in the gaps between them; its training error was nearly zero and its real error enormous. Training performance is not the performance you will get on data you have not seen, and confusing the two is the most common way to fool yourself in machine learning.

This chapter is about not fooling yourself. We hold data back, measure on it honestly, and find that a single held-out slice is a noisy and wasteful way to do so. The fix — cross-validation — uses every data point as both training and test, never at the same time, and it is the engine behind almost all honest model comparison and hyperparameter tuning. It is also, at last, how you would actually choose Chapter 9's λ.

By the end you will be able to estimate how well a model will really do, and choose its settings, without ever once peeking at the data you have set aside to judge it.

1. The optimism of training error

Every loss we have minimised — squared error, cross-entropy — was computed on the training set. That is necessary: the training set is the only data the optimiser is allowed to see. But it makes the training error a flattering witness. The model was tuned, parameter by parameter, to make exactly that number small. Reporting it as a measure of quality is like grading an exam the student wrote with the answer key open.

The gap has a direction. Training error is optimistically biased — it is, on average, lower than the error the model will make on fresh data, because the model has fit not just the signal in the training set but some of its noise as well. A flexible enough model can drive training error to zero by memorising every point, noise and all, while learning nothing that transfers. Chapter 9 called this overfitting and tamed it with a penalty. Here we confront the prior question: how would you even detect it? You cannot see overfitting in the training error — that just keeps falling. You need a second number, measured somewhere the model never got to cheat.

2. The train/test split

The simplest honest measurement is a train/test split. Before fitting anything, randomly partition the data into two parts — say 70% for training and 30% held back. Fit the model on the training part only. Then evaluate it, exactly once, on the held-back test set. Because the model never saw the test data during fitting, its error there is an unbiased estimate of how it will perform on genuinely new data.

The protocol has one inviolable rule, and the rest of the chapter is essentially consequences of taking it seriously: the model never learns from the test data. Not its parameters, not its hyperparameters, not the choice of which features to use, not the decision to standardise — if any decision is influenced by the test set, the test set is no longer a fair estimate of generalisation. It has become, quietly, part of the training data.

A held-out estimate answers the question training error cannot: as a model grows more flexible, does it keep getting genuinely better, or has it started memorising? The test error will tell you, because memorised noise does not transfer. Watch it happen.

3. The optimism gap, made visible

The widget below fits polynomials of increasing degree to one train/test split of a noisy dataset, and plots both errors as the degree climbs.

Polynomial degree = 7

train MSE = 0.0217

test MSE = 0.0330

sweet spot = degree 7

Figure 11.1 — One train/test split of a noisy dataset. As polynomial degree grows, training error (grey, dashed) falls without limit, but the held-out test error (teal) bottoms out and turns back up. The gap between them is overfitting, measured. The inset shows the fit itself at the chosen degree; drag past the sweet spot and watch it start chasing the noise.

The two curves tell completely different stories. Training error (grey, dashed) falls monotonically — more flexibility always fits the training points better, all the way down toward zero. Test error (teal) does something else entirely: it falls while added flexibility is capturing real structure, bottoms out at a sweet-spot degree, then turns back up as the extra degrees start fitting noise that does not exist in the held-out data. The widening gap between the two curves is overfitting, quantified.

Drag the degree slider and watch the inset. At low degree the fit is too stiff to follow the curve — underfitting, both errors high. Around the sweet spot the fit tracks the true function and the two errors are close. Push past it and the curve starts wriggling through individual training points, the training error inching lower while the test error climbs — the model is now memorising. The lesson is blunt: the model with the lowest training error is almost never the one you want. You want the one with the lowest held-out error, and the only way to know which that is, is to have held data out.

4. One split is not enough

A single train/test split is honest, but it has two flaws.

The first is variance. The estimate depends on exactly which points happened to land in the test set. Get an easy random test set and the model looks great; get a hard one and the same model looks poor. With a small dataset the wobble between splits can be larger than the difference between two models you are trying to compare — at which point the split is measuring luck, not quality.

The second is waste. The test set — often a quarter or a third of your data — never contributes to fitting the model. When data is scarce, and it usually is, holding a third of it permanently in reserve is a real cost: the model you finally ship was trained on less data than you had.

You can see the tension. To reduce variance you want a big test set; to avoid waste you want a big training set; and the same points cannot do both jobs at once — in a single split. The trick of cross-validation is to let every point do both jobs, just never at the same time.

5. k-fold cross-validation

k-fold cross-validation resolves the tension with a rotation. Partition the data into k equal folds. Train on k−1 of them and test on the one left out. Then do it again with a different fold left out, and again, until each of the k folds has served as the test set exactly once. You end up with k held-out scores; average them.

Folds k = 5

fold 1 / 5: accuracy 0.958

CV estimate = 0.925 ± 0.061

single-fold range = 0.833 – 1.000

Figure 11.2 — k-fold cross-validation. The data is partitioned into k folds; each fold is held out as the test set once while the other k−1 train the model, giving k accuracy estimates. Step through the folds to see each one held out in turn. Any single fold lands anywhere in the range below; the average across all k is the stable estimate. Raise k and watch the spread of individual folds widen even as the mean barely moves.

Every point is now tested exactly once and used for training in k−1 of the k fits. Nothing is wasted, and the average of k held-out scores is a far more stable estimate than any single split — averaging cancels much of the which-points-landed-where luck. Step through the folds in the widget: each fold takes its turn as the held-out test band, and its score is one of the dots below. Notice how those dots scatter — any single fold, which is exactly one train/test split, could have landed anywhere in that range — while their average sits stable near the centre. Raise k and the individual folds scatter more (each test set is now tinier and noisier), yet the mean barely moves. That stability is the whole point.

The standard choice is k = 5 or k = 10: enough folds that each training set is nearly the full dataset, few enough that you only pay for five or ten fits. The extreme k = n — every fold a single point — is called leave-one-out cross-validation; it wrings out maximal training data per fit but costs n fits and gives a higher-variance average, so it is reserved for small datasets where every point counts.

6. Choosing hyperparameters with cross-validation

Cross-validation's real job is not grading a finished model — it is choosing between models, or between settings of one model. Chapter 9 left a question open: how do you pick λ, the regularisation strength? Training error is useless for this, because λ = 0 always minimises training error — no penalty, maximum freedom to fit. The honest answer is to cross-validate each candidate λ and pick the one with the best average held-out score.

Folds k = 5Show training error

best λ = 6.1e-4

CV error at best = 0.0509

train error at λ→0 = 0.0258

Figure 11.3 — Choosing the ridge penalty λ for a degree-12 polynomial by k-fold cross-validation — the same overfitting-prone fit as Figure 11.1, now tuned by λ instead of degree. Training error (grey, dashed) sinks toward zero as λ → 0, the polynomial wriggling through the noise. The cross-validated error (teal, with per-fold error bars) traces a clean U: it is worst at λ → 0 (overfitting) and at large λ (over-smoothing), and lowest at a non-trivial λ in between. The training error would drive λ to zero; cross-validation finds the bottom of the U.

The widget returns to Figure 11.1's overfitting-prone fit — a degree-12 polynomial — but instead of tuning its degree it now tunes its ridge penalty λ, exactly the knob Chapter 9 left for us to set. Training error (grey, dashed) does what training error always does: it sinks toward zero as λ → 0, the unregularised polynomial free to wriggle through every point. The cross-validated error (teal, with one error bar per fold showing the spread) traces a clean U. It is worst at λ → 0, where the fit overfits the noise, and worst again at large λ, where so much penalty is applied that the polynomial can no longer follow the real curve; in between it bottoms out at a non-trivial λ. The training error would send λ straight to zero and call it optimal; cross-validation finds the bottom of the U — the λ that genuinely generalises — without ever touching a test set.

This is the standard recipe, called grid search: enumerate candidate hyperparameter values, cross-validate each, keep the best. When there are several hyperparameters at once the grid becomes a multi-dimensional lattice, and random search — sampling combinations at random rather than exhaustively — often finds a good corner of it with far fewer fits.

7. The leakage trap: never tune on the test set

There is a subtle, fatal way to misuse everything above, and almost everyone does it at least once. Suppose you split off a test set, then try a hundred values of λ, evaluate each on the test set, and report the best. The number you report is a lie — and not a small one.

The moment you chose λ by the test score, you tuned to the test set. Across a hundred random tries, one will look good on that particular test set by chance alone, and that is exactly the one you selected. The test score of your "best" model is now optimistically biased, just like training error was, and for the same reason: it influenced a decision. You have a third dataset's worth of overfitting, hidden inside the hyperparameter search.

The fix is a third split. Carve the data into training, validation, and test. Fit parameters on training, choose hyperparameters by the validation score, and touch the test set exactly once, at the very end, to report the final number — after every decision is locked. The cleaner version nests cross-validation: an inner CV loop chooses the hyperparameters, an outer CV loop estimates performance, and the two never contaminate each other. This is nested cross-validation.

The discipline reduces to one sentence: the test set is spent the instant you look at it. Any number that influenced a choice cannot also be an unbiased estimate of the result of that choice. Keep one slice of data genuinely untouched until there are no decisions left to make.

8. Practical concerns

A handful of things that separate cross-validation that works from cross-validation that quietly lies.

Stratify for imbalance. With the rare-positive data of Chapter 10, a random fold might contain almost no positives — or, with bad luck, none at all, making a metric undefined. Stratified k-fold preserves each class's proportion within every fold, so all k estimates are measuring the same problem. For classification, stratified folds should be the default.

Preprocess inside the fold. This is the most common silent leak. Any step that learns from the data — standardising features by their mean and variance, selecting the most predictive features, imputing missing values — must be fit on the training fold only, then applied to the test fold. Fit it on the whole dataset before splitting and the test fold has influenced the preprocessing, which means information has leaked from test to train and your CV score is optimistic. Problem 5 makes this concrete, and the result is alarming.

Respect structure in the data. If rows are grouped — multiple readings from the same patient, several days from the same user — a plain random split can put one patient's data in both train and test, and the model scores well by recognising the patient rather than learning the task. Grouped splitting keeps each group wholly within one fold. The same care applies to time series, where the test set must come strictly after the training set, never shuffled across time.

9. Complexity

Cross-validation's cost is a simple multiplier: k-fold CV trains the model k times, so it costs k× a single fit. With the $O(np)$ -per-step logistic regression of Chapter 7, five-fold CV is five training runs — pennies.

Hyperparameter search multiplies again. Grid search over g candidate values costs $g \cdot k$ fits; a two-dimensional grid of g values each costs $g^2 k$ . Nested CV, with an outer loop of $k_\text{out}$ folds, multiplies the whole inner search by $k_\text{out}$ . The numbers compound quickly — a modest $10 \times 10$ grid under nested 5-fold CV is 2,500 fits — which is why random search, early stopping of unpromising candidates, and warm-starting each fit from the previous one (as the widgets here do across λ) all matter in practice.

The honest estimate is never free. But the alternative — shipping a model whose true performance you have never measured — is the most expensive mistake of all, and it usually shows up in production rather than in the paper.

10. Implementing it yourself

A k-fold split, a cross_val_score, and a grid search for λ — reusing the L2 logistic regression from Chapter 9, in about thirty lines.

Python · runs in browser

import numpy as np

# Three signal features among eight noise features.
rng = np.random.default_rng(0)
n, n_signal, n_noise = 200, 3, 8
p = n_signal + n_noise
X = rng.normal(size=(n, p))
true_w = np.zeros(p); true_w[:n_signal] = [1.5, -1.1, 0.8]
y = ((X @ true_w + rng.normal(0, 0.5, n)) > 0).astype(float)

def sigmoid(z):
  return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))

def fit_l2(X, y, lam, lr=0.3, steps=400):
  n, p = X.shape
  w = np.zeros(p); b = 0.0
  for _ in range(steps):
      pr = sigmoid(X @ w + b)
      w -= lr * (X.T @ (pr - y) / n + 2 * lam * w)
      b -= lr * np.mean(pr - y)
  return w, b

def accuracy(w, b, X, y):
  return np.mean((sigmoid(X @ w + b) >= 0.5) == y)

def k_fold_indices(n, k, seed=0):
  idx = np.random.default_rng(seed).permutation(n)
  return list(np.array_split(idx, k))   # k contiguous, disjoint folds

def cross_val_score(X, y, lam, k=5, seed=0):
  folds = k_fold_indices(len(y), k, seed)
  scores = []
  for i in range(k):
      test = folds[i]
      train = np.concatenate([folds[j] for j in range(k) if j != i])
      w, b = fit_l2(X[train], y[train], lam)
      scores.append(accuracy(w, b, X[test], y[test]))
  return np.array(scores)

# Grid search: cross-validate each candidate lambda, pick the best.
print(f"{'lambda':>8s}   CV accuracy")
best_lam, best_cv = None, -1.0
for lam in [0.0, 0.01, 0.03, 0.1, 0.3, 1.0]:
  s = cross_val_score(X, y, lam)
  print(f"{lam:8.2f}   {s.mean():.3f} +/- {s.std():.3f}")
  if s.mean() > best_cv:
      best_cv, best_lam = s.mean(), lam
print(f"\nbest lambda by CV: {best_lam}  (CV accuracy {best_cv:.3f})")

That is the whole apparatus. k_fold_indices shuffles once and cuts into contiguous folds; cross_val_score rotates the held-out fold and averages; the grid search is a loop. Everything else in this chapter is discipline about which data you are allowed to feed these functions.

11. Problems

Problem 1 — Why two different kinds of wrong?

A conceptual question. Training error and a single train/test split are both imperfect estimates of how a model will perform on new data — but they are imperfect in different ways. Describe each one's flaw precisely: which is biased and in which direction, and which is high-variance? Then explain what k-fold cross-validation does about each.

Show solution

Training error is biased — optimistically. It is computed on the very data the model was fit to, so it systematically understates the error on new data: the model has absorbed some of the training set's noise as if it were signal. The bias has a definite direction (training error is, on average, too low) and it grows with model flexibility.

A single train/test split is (nearly) unbiased but high-variance. Because the test set was never used in fitting, its error is an honest estimate of generalisation — there is no systematic direction to its error. But it depends entirely on which points happened to fall in the test set; resplit with a different seed and the number jumps around, especially on small data.

k-fold cross-validation attacks the variance. It keeps the unbiasedness of held-out evaluation (every score is measured on data the model did not train on) but averages over k different held-out sets, which cancels much of the which-points-landed-where luck. It also recovers the wasted data: across the k fits, every point trains the model k−1 times. What it cannot fix is bias from looking at the held-out data to make a decision — that is the leakage of Problem 5.

Problem 2 — Implement a k-fold split

Write k_fold_indices(n, k, seed) that returns a list of k arrays of indices. Then verify the two properties every k-fold split must have: the folds are disjoint, and together they cover all n indices exactly once.

Python · runs in browser

import numpy as np

def k_fold_indices(n, k, seed=0):
  # Your code:
  #   1. Make a seeded random permutation of 0..n-1.
  #   2. Split it into k roughly-equal contiguous folds.
  #   3. Return them as a list of arrays.
  pass

# Verify: 23 points, 5 folds.
folds = k_fold_indices(23, 5, seed=0)
print("fold sizes:", [len(f) for f in folds])
allidx = np.concatenate(folds)
print("covers 0..22 exactly once:", sorted(allidx.tolist()) == list(range(23)))

Show solution

Python · runs in browser

import numpy as np

def k_fold_indices(n, k, seed=0):
  idx = np.random.default_rng(seed).permutation(n)
  return list(np.array_split(idx, k))

folds = k_fold_indices(23, 5, seed=0)
print("fold sizes:", [len(f) for f in folds])
allidx = np.concatenate(folds)
print("covers 0..22 exactly once:", sorted(allidx.tolist()) == list(range(23)))

# disjoint check
from itertools import combinations
disjoint = all(set(a).isdisjoint(b) for a, b in combinations(folds, 2))
print("folds pairwise disjoint:", disjoint)

np.array_split handles the case where n is not divisible by k by making the first few folds one element larger — here, sizes [5, 5, 5, 4, 4] summing to 23. Shuffling first is essential: without it, fold boundaries would align with any ordering in the data (all the positives at the end, say), and each fold would be unrepresentative.

Problem 3 — Implement cross_val_score

Using the k-fold split from Problem 2, implement cross_val_score for a logistic-regression model and report the mean and standard deviation of the per-fold accuracies.

Python · runs in browser

import numpy as np

rng = np.random.default_rng(1)
n, p = 200, 5
X = rng.normal(size=(n, p))
true_w = np.array([1.4, -1.0, 0.7, 0, 0])
y = ((X @ true_w + rng.normal(0, 0.4, n)) > 0).astype(float)

def sigmoid(z):
  return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))

def fit(X, y, lr=0.3, steps=400):
  w = np.zeros(X.shape[1]); b = 0.0
  for _ in range(steps):
      pr = sigmoid(X @ w + b)
      w -= lr * (X.T @ (pr - y) / len(y)); b -= lr * np.mean(pr - y)
  return w, b

def k_fold_indices(n, k, seed=0):
  return list(np.array_split(np.random.default_rng(seed).permutation(n), k))

def cross_val_score(X, y, k=5, seed=0):
  # Your code: rotate the held-out fold, fit on the rest, score accuracy
  # on the held-out fold, and return the array of k scores.
  pass

s = cross_val_score(X, y)
print(f"per-fold accuracy: {np.round(s, 3)}")
print(f"CV estimate: {s.mean():.3f} +/- {s.std():.3f}")

Show solution

Python · runs in browser

import numpy as np

rng = np.random.default_rng(1)
n, p = 200, 5
X = rng.normal(size=(n, p))
true_w = np.array([1.4, -1.0, 0.7, 0, 0])
y = ((X @ true_w + rng.normal(0, 0.4, n)) > 0).astype(float)

def sigmoid(z):
  return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))

def fit(X, y, lr=0.3, steps=400):
  w = np.zeros(X.shape[1]); b = 0.0
  for _ in range(steps):
      pr = sigmoid(X @ w + b)
      w -= lr * (X.T @ (pr - y) / len(y)); b -= lr * np.mean(pr - y)
  return w, b

def k_fold_indices(n, k, seed=0):
  return list(np.array_split(np.random.default_rng(seed).permutation(n), k))

def cross_val_score(X, y, k=5, seed=0):
  folds = k_fold_indices(len(y), k, seed)
  scores = []
  for i in range(k):
      test = folds[i]
      train = np.concatenate([folds[j] for j in range(k) if j != i])
      w, b = fit(X[train], y[train])
      scores.append(np.mean((sigmoid(X[test] @ w + b) >= 0.5) == y[test]))
  return np.array(scores)

s = cross_val_score(X, y)
print(f"per-fold accuracy: {np.round(s, 3)}")
print(f"CV estimate: {s.mean():.3f} +/- {s.std():.3f}")

The per-fold accuracies vary by several points around their mean — that spread is the variance a single split would have been at the mercy of. The mean is the estimate you report; the standard deviation tells you how much to trust it.

Problem 4 — Tune λ, and watch the test set lie

Choose λ two ways on the same data: honestly, by cross-validation on a training set; and dishonestly, by picking whichever λ scores best on the test set. Then reveal both models' true performance on a fresh hold-out. The dishonest estimate should look better than it deserves.

Python · runs in browser

import numpy as np

rng = np.random.default_rng(2)
n, p = 160, 15          # 15 features, only 3 carry signal
X = rng.normal(size=(n, p))
true_w = np.zeros(p); true_w[:3] = [1.6, -1.2, 0.9]
y = ((X @ true_w + rng.normal(0, 0.5, n)) > 0).astype(float)

def sigmoid(z):
  return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))

def fit_l2(X, y, lam, lr=0.3, steps=400):
  w = np.zeros(X.shape[1]); b = 0.0
  for _ in range(steps):
      pr = sigmoid(X @ w + b)
      w -= lr * (X.T @ (pr - y) / len(y) + 2 * lam * w); b -= lr * np.mean(pr - y)
  return w, b

def acc(w, b, X, y):
  return np.mean((sigmoid(X @ w + b) >= 0.5) == y)

def k_fold_indices(m, k, seed=0):
  return list(np.array_split(np.random.default_rng(seed).permutation(m), k))

# Split off a final test set we will only touch at the very end.
perm = np.random.default_rng(0).permutation(n)
test = perm[:50]; pool = perm[50:]
Xp, yp, Xt, yt = X[pool], y[pool], X[test], y[test]
lambdas = [0.0, 0.01, 0.03, 0.1, 0.3, 1.0]

# Your code:
#   A) HONEST: choose lambda by 5-fold CV on (Xp, yp). Report its test acc.
#   B) LEAKY:  choose lambda by whichever maximises acc on (Xt, yt) directly.
#              Report that same test acc. Compare.

Show solution

Python · runs in browser

import numpy as np

rng = np.random.default_rng(2)
n, p = 160, 15
X = rng.normal(size=(n, p))
true_w = np.zeros(p); true_w[:3] = [1.6, -1.2, 0.9]
y = ((X @ true_w + rng.normal(0, 0.5, n)) > 0).astype(float)

def sigmoid(z):
  return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))

def fit_l2(X, y, lam, lr=0.3, steps=400):
  w = np.zeros(X.shape[1]); b = 0.0
  for _ in range(steps):
      pr = sigmoid(X @ w + b)
      w -= lr * (X.T @ (pr - y) / len(y) + 2 * lam * w); b -= lr * np.mean(pr - y)
  return w, b

def acc(w, b, X, y):
  return np.mean((sigmoid(X @ w + b) >= 0.5) == y)

def k_fold_indices(m, k, seed=0):
  return list(np.array_split(np.random.default_rng(seed).permutation(m), k))

perm = np.random.default_rng(0).permutation(n)
test = perm[:50]; pool = perm[50:]
Xp, yp, Xt, yt = X[pool], y[pool], X[test], y[test]
lambdas = [0.0, 0.01, 0.03, 0.1, 0.3, 1.0]

# A) HONEST — choose lambda by CV on the pool, never touching the test set.
def cv(X, y, lam, k=5):
  folds = k_fold_indices(len(y), k, 0); s = []
  for i in range(k):
      te = folds[i]; tr = np.concatenate([folds[j] for j in range(k) if j != i])
      w, b = fit_l2(X[tr], y[tr], lam); s.append(acc(w, b, X[te], y[te]))
  return np.mean(s)

cv_scores = {lam: cv(Xp, yp, lam) for lam in lambdas}
lam_honest = max(cv_scores, key=cv_scores.get)
w, b = fit_l2(Xp, yp, lam_honest)
print(f"HONEST: lambda={lam_honest} chosen by CV (CV acc {cv_scores[lam_honest]:.3f}); true test acc = {acc(w, b, Xt, yt):.3f}")

# B) LEAKY — choose lambda by the test set itself, then report that score.
test_scores = {}
for lam in lambdas:
  w, b = fit_l2(Xp, yp, lam); test_scores[lam] = acc(w, b, Xt, yt)
lam_leaky = max(test_scores, key=test_scores.get)
print(f"LEAKY:  lambda={lam_leaky} chosen on the test set; reported test acc = {test_scores[lam_leaky]:.3f}")
print(f"\nThe leaky number is the best of {len(lambdas)} tries on one 50-point test set —")
print("inflated by chance. The honest number is what you'd actually get in production.")

The leaky procedure reports the maximum test score over all the λ it tried, so it cannot help but equal or beat the honest pick on that specific test set — it selected for exactly that. The gap between them is pure optimism, manufactured by letting the test set influence a choice. With more candidate values or a smaller test set the inflation grows. The honest pick used cross-validation entirely within the pool and saved the test set for a single, final, untainted measurement.

Problem 5 — Leakage: 77% accuracy on pure noise

The subtlest leak is preprocessing on the whole dataset before splitting. Here is the alarming version: a dataset of pure noise with random labels — there is nothing to learn, so honest accuracy must be about 50%. Select the most label-correlated features on the full dataset first, then cross-validate, and watch CV accuracy sail past chance. Then do the selection inside each fold and watch it collapse back to 50%.

Python · runs in browser

import numpy as np

rng = np.random.default_rng(0)
n, p = 100, 500                 # 500 pure-noise features
X = rng.normal(size=(n, p))
y = rng.integers(0, 2, n).astype(float)   # labels unrelated to X

def sigmoid(z):
  return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))

def fit(X, y, lr=0.3, steps=300):
  w = np.zeros(X.shape[1]); b = 0.0
  for _ in range(steps):
      pr = sigmoid(X @ w + b)
      w -= lr * (X.T @ (pr - y) / len(y)); b -= lr * np.mean(pr - y)
  return w, b

def acc(w, b, X, y):
  return np.mean((sigmoid(X @ w + b) >= 0.5) == y)

def k_fold_indices(m, k, seed=0):
  return list(np.array_split(np.random.default_rng(seed).permutation(m), k))

def top_features(X, y, m=15):
  corr = np.array([abs(np.corrcoef(X[:, j], y)[0, 1]) for j in range(X.shape[1])])
  return np.argsort(-corr)[:m]

# Your code:
#   WRONG: pick top_features(X, y) on the WHOLE dataset, then 5-fold CV
#          using only those columns.
#   RIGHT: inside each fold, pick top_features on the TRAINING part only,
#          fit, and score the test part with those same columns.
#   Print both mean CV accuracies.

Show solution

Python · runs in browser

import numpy as np

rng = np.random.default_rng(0)
n, p = 100, 500
X = rng.normal(size=(n, p))
y = rng.integers(0, 2, n).astype(float)

def sigmoid(z):
  return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))

def fit(X, y, lr=0.3, steps=300):
  w = np.zeros(X.shape[1]); b = 0.0
  for _ in range(steps):
      pr = sigmoid(X @ w + b)
      w -= lr * (X.T @ (pr - y) / len(y)); b -= lr * np.mean(pr - y)
  return w, b

def acc(w, b, X, y):
  return np.mean((sigmoid(X @ w + b) >= 0.5) == y)

def k_fold_indices(m, k, seed=0):
  return list(np.array_split(np.random.default_rng(seed).permutation(m), k))

def top_features(X, y, m=15):
  corr = np.array([abs(np.corrcoef(X[:, j], y)[0, 1]) for j in range(X.shape[1])])
  return np.argsort(-corr)[:m]

folds = k_fold_indices(n, 5, 0)

# WRONG: feature selection on the whole dataset, before CV.
cols = top_features(X, y, 15)
leak = []
for i in range(5):
  te = folds[i]; tr = np.concatenate([folds[j] for j in range(5) if j != i])
  w, b = fit(X[tr][:, cols], y[tr]); leak.append(acc(w, b, X[te][:, cols], y[te]))

# RIGHT: feature selection inside each fold, on the training part only.
honest = []
for i in range(5):
  te = folds[i]; tr = np.concatenate([folds[j] for j in range(5) if j != i])
  cols_tr = top_features(X[tr], y[tr], 15)
  w, b = fit(X[tr][:, cols_tr], y[tr]); honest.append(acc(w, b, X[te][:, cols_tr], y[te]))

print(f"select on FULL data, then CV:  {np.mean(leak):.3f}   <- pure noise!")
print(f"select INSIDE each fold:       {np.mean(honest):.3f}   <- the truth (~0.5)")

There is no signal anywhere in this data — the labels are random coin flips. Yet selecting features on the full dataset reports about 77% accuracy, because among 500 noise features a handful will, by pure chance, correlate with the labels across all 100 points — including the points that later sit in each test fold. Choosing those features lets information leak from test to train, and the model "predicts" the very coincidence it was handed. Move the selection inside each fold, where it only ever sees training labels, and the illusion vanishes: accuracy falls to chance, exactly as it must. Any data-dependent step done before the split is a leak; the rule is that the test fold must be untouched by every choice, preprocessing included.

Next: Chapter 12 — Decision trees. So far every model has drawn a straight line, or a straight line in some transformed space; a decision tree carves the input space into rectangles by asking one yes-or-no question at a time — and the cross-validation you just learned is exactly the tool to catch the vivid new way it can overfit.