Lab 08: Kernel Approximation Methods — Tips & Reference Guide
Background: The Kernel Scalability Problem
A kernel function \(K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) measures similarity between two points. The most popular is the Gaussian (RBF) kernel:
For a dataset of \(n\) points in \(\mathbb{R}^d\):
| Resource | Cost |
|---|---|
| Memory (Gram matrix) | \(O(n^2)\) floats |
| Kernel matrix construction | \(O(n^2 d)\) |
| Kernel SVM training | \(O(n^2)\) to \(O(n^3)\) |
At \(n = 100{,}000\) the Gram matrix alone requires ~80 GB of RAM. This is the "kernel wall".
Exercise 1: Exact RBF Kernel
Vectorised distance computation
The naive way to build the Gram matrix is a Python double loop — fine for n=10, catastrophic for n=10,000. The key insight is to expand the squared distance:
This lets you compute all \(n^2\) pairwise distances at once with three numpy operations: row norms, row norms transposed, and a matrix product.
Implementation hints
TODO 1 — rbf_kernel(x, y, sigma)
Compute the difference vector diff = x - y, then get its squared length as np.dot(diff, diff) — this is \(\|x-y\|^2\). Plug that directly into the formula:
The result is a Python float. Note that rbf_kernel(x, x, sigma) must return exactly 1.0 since the distance is zero.
TODO 2 — rbf_kernel_matrix(X, sigma)
A Python double loop (for i ... for j ...) works but is far too slow for large \(n\). Use the squared-distance identity instead:
In numpy this translates to three steps:
-
Row norms: compute a shape
This squares every element of(n,)vector where entry \(i\) is \(\|x_i\|^2\).Xand then sums alongaxis=1(across columns), leaving one value per row — exactly \(\sum_j X_{ij}^2 = \|x_i\|^2\). -
Distance matrix: combine with broadcasting and a matrix product:
sq_norms[:, None]is shape(n, 1)andsq_norms[None, :]is(1, n)— numpy broadcasts them into an(n, n)matrix.X @ X.Tis the Gram matrix of inner products, also(n, n). -
Clamp and exponentiate: floating point arithmetic can produce small negative values (e.g.
-1e-14) on the diagonal. Clamp first:D2 = np.maximum(D2, 0.0), thenK = np.exp(-D2 / (2 * sigma**2)).
For the memory warning, use warnings.warn("...", ResourceWarning, stacklevel=2) so the warning points at the caller's line, not inside the function.
Exercise 2: Random Fourier Features
Theory (Rahimi & Recht, NeurIPS 2007)
By Bochner's theorem, any shift-invariant kernel \(K(x, y) = k(x - y)\) is the Fourier transform of a non-negative spectral measure \(p(\omega)\):
For the RBF kernel, the spectral density is \(p(\omega) = \mathcal{N}(0,\, I/\sigma^2)\).
Drawing \(D\) frequencies \(\omega_j \sim p(\omega)\) and phases \(b_j \sim \text{Uniform}[0, 2\pi]\), the feature map is:
This satisfies \(\mathbb{E}[z(x)^\top z(y)] = K(x, y)\) — a dot product replaces an expensive kernel evaluation. The approximation error converges as \(O(1/\sqrt{D})\).
Memory comparison
| Method | Matrix shape | Memory |
|---|---|---|
| Exact | \(n \times n\) | \(O(n^2)\) |
| RFF feature map | \(n \times D\) | \(O(nD)\) |
For \(n = 10{,}000\) and \(D = 1{,}000\): exact = 800 MB, RFF = 80 MB.
Implementation hints
TODO 3 — sample_rff_weights(D, d, sigma, rng)
You need to return two arrays:
omega, shape(D, d): each of the \(D\) rows is a frequency vector drawn from \(\mathcal{N}(0, I/\sigma^2)\). Start withrng.standard_normal((D, d))which gives you \(\mathcal{N}(0, 1)\) entries, then divide the whole matrix bysigmato rescale the variance. Dividing bysigma(notsigma**2) is correct because the standard deviation of \(\mathcal{N}(0, 1/\sigma^2)\) is \(1/\sigma\).b, shape(D,): each phase shift is drawn uniformly from \([0, 2\pi]\). Userng.uniform(0, 2 * np.pi, size=D).
Return them as a tuple (omega, b).
TODO 4 — rff_features(X, omega, b)
Build the feature matrix \(Z\) of shape (n, D) in three sub-steps:
- Compute the projection
X @ omega.T—Xis(n, d)andomega.Tis(d, D), so the result is(n, D). Entry[i, j]is \(\omega_j^\top x_i\). - Add
b(shape(D,)). NumPy broadcasts it across allnrows, adding \(b_j\) to every entry in columnj. The result is still(n, D). - Apply
np.coselement-wise, then multiply by \(\sqrt{2/D}\). GetDfromomega.shape[0].
The \(\sqrt{2/D}\) factor is essential — without it the dot product \(Z_X Z_Y^\top\) would overestimate the kernel by a factor of \(D/2\).
TODO 5 — rff_kernel_approx(X, Y, omega, b)
Call rff_features twice — once for X and once for Y — using the same omega and b both times. This is important: both datasets must be projected into the same random feature space for the dot product to approximate \(K(X, Y)\).
Z_X = rff_features(X, omega, b)→ shape(n, D)Z_Y = rff_features(Y, omega, b)→ shape(m, D)- Return
Z_X @ Z_Y.T→ shape(n, m), where entry[i, j]approximates \(K(x_i, y_j)\).
Exercise 3: Orthogonal Random Features
Theory (Yu et al., ICML 2016)
With independent RFF samples, two frequency vectors \(\omega_i, \omega_j\) may nearly align, wasting representational capacity. ORF forces the weight rows to be mutually orthogonal, reducing approximation variance without changing the mean.
The construction works in blocks of size \(d\):
- Draw \(G \in \mathbb{R}^{d \times d}\) with i.i.d. \(\mathcal{N}(0,1)\) entries.
- QR-decompose \(G = QR\). The rows of \(Q^\top\) are orthonormal.
- Orthonormal rows have unit norm, but we need the correct \(\chi_d\)-distributed norms. Fix this by multiplying each row by an independent \(\|g_j\|\) where \(g_j \sim \mathcal{N}(0, I_d)\).
- Divide by \(\sigma\).
- Stack \(\lceil D/d \rceil\) such blocks and trim to \(D\) rows.
Result: ORF has the same expected value as RFF but strictly lower variance.
Implementation hints
TODO 6 — sample_orf_weights(D, d, sigma, rng)
The goal is to build omega of shape (D, d) where the rows are orthogonal to each other (unlike RFF where they are independent). Because you can only have at most d mutually orthogonal vectors in \(\mathbb{R}^d\), you build the matrix in blocks of d rows, each block independently orthogonal within itself.
Step 1 — decide how many blocks you need.
If D=10 and d=4, you need 3 blocks (3×4=12 rows, trimmed to 10 later).
Step 2 — build one block (repeat this in a loop n_blocks times).
2a. Draw a square Gaussian matrix:
2b. QR-decompose it:
After the transpose, each row ofQ is a unit vector and all rows are mutually perpendicular. This is our set of orthogonal directions.
2c. Rescale rows to have the right length distribution. Orthonormal rows all have norm 1, but in RFF each frequency row has a random norm drawn from the \(\chi_d\) distribution (the distribution of \(\|g\|\) where \(g \sim \mathcal{N}(0, I_d)\)). To match this, draw a fresh Gaussian matrix and take its row norms:
S = rng.standard_normal((d, d)) # independent fresh draw
norms = np.linalg.norm(S, axis=1) # shape (d,) — one chi_d sample per row
Q = Q * norms[:, None] # scale each row; [:, None] makes norms a column vector
norms[:, None] has shape (d, 1), so multiplying by Q (shape (d, d)) scales row i of Q by norms[i].
2d. Append to the list:
Step 3 — assemble, trim, and scale.
omega = np.vstack(blocks) # shape (n_blocks * d, d)
omega = omega[:D] # trim to exactly (D, d)
omega = omega / sigma # rescale to N(0, I/σ²)
Step 4 — sample the phase shifts (identical to RFF):
Return (omega, b).
TODO 7 — orf_features(X, omega, b)
The feature formula is exactly the same as RFF. Call rff_features(X, omega, b) and return the result. The lower variance comes from the orthogonal omega, not from a different formula.
Exercise 4: Nyström Approximation
Theory
Given \(m \ll n\) landmark points \(\{u_1, \ldots, u_m\}\), define:
- \(K_{mm} \in \mathbb{R}^{m \times m}\): kernel matrix between landmarks
- \(K_{nm} \in \mathbb{R}^{n \times m}\): kernel between all \(n\) points and the \(m\) landmarks
The Nyström approximation:
Why it works: If the kernel's eigenvalues decay quickly (common with smooth kernels), \(m\) landmarks capture most of the variance. Unlike RFF/ORF, Nyström adapts to the actual data geometry.
Stable computation of \(K_{mm}^{-1/2}\)
Direct inversion is numerically fragile. Instead, eigendecompose \(K_{mm} = V \Lambda V^\top\) and compute:
Use scipy.linalg.eigh (not np.linalg.eig) — it exploits symmetry and returns real eigenvalues in ascending order.
Implementation hints
TODO 8 — select_landmarks(X, m, strategy, rng)
Get n = X.shape[0] first, then branch on strategy:
"random": pickmunique row indices withrng.choice(n, m, replace=False), then returnX[idx]. Thereplace=Falseensures no row is selected twice."stride": pickmevenly spaced indices withnp.linspace(0, n-1, m, dtype=int), then returnX[idx]. This spreads landmarks uniformly regardless of how the data is ordered.- Anything else:
raise ValueError(f"Unknown strategy '{strategy}'...").
TODO 9 — nystrom_features(X, landmarks, sigma)
The goal is to return Z = K_nm @ K_mm^{-1/2} with shape (n, m). Here is each step with the exact numpy calls:
Step 1 — landmark kernel matrix, shape (m, m):
Step 2 — cross kernel matrix between X and landmarks, shape (n, m):
Use the same distance identity as TODO 2, but now between two different matrices:
sq_X = (X ** 2).sum(axis=1) # shape (n,)
sq_L = (landmarks ** 2).sum(axis=1) # shape (m,)
D2 = sq_X[:, None] + sq_L[None, :] - 2 * (X @ landmarks.T) # shape (n, m)
K_nm = np.exp(-np.maximum(D2, 0.0) / (2 * sigma ** 2))
X @ landmarks.T is (n, d) @ (d, m) = (n, m) — all inner products at once.
Step 3 — eigendecompose K_mm:
eigh is for symmetric matrices and always returns real eigenvalues in ascending order.
Step 4 — clip eigenvalues and form K_mm^{-1/2}:
lam = np.maximum(lam, 1e-10) # avoid division by zero
K_mm_inv_sqrt = V @ np.diag(1.0 / np.sqrt(lam)) @ V.T # shape (m, m)
np.diag(1.0 / np.sqrt(lam)) builds a diagonal matrix from the reciprocal square roots of the eigenvalues.
Step 5 — return the feature matrix:
TODO 10 — nystrom_kernel_approx(X, Y, landmarks, sigma)
Identical pattern to rff_kernel_approx (TODO 5): compute Nyström features for both X and Y using the same landmarks, then return their dot product:
Z_X = nystrom_features(X, landmarks, sigma)→ shape(n, m)Z_Y = nystrom_features(Y, landmarks, sigma)→ shape(m_pts, m)- Return
Z_X @ Z_Y.T→ shape(n, m_pts)
Using the same landmarks for both is essential — it puts X and Y into the same feature space so the dot product approximates \(K(X, Y)\).
Exercise 5: Kernel Ridge Regression
The dual formulation
Kernel ridge regression solves in the dual space of coefficients \(\alpha \in \mathbb{R}^n\):
Predictions for new points: \(\hat{y}(x^*) = k_*^\top \alpha^*\) where \(k_* = [K(x^*, x_i)]_i\).
The same solver works whether K is exact or approximate — that's the power of the dual formulation.
Why not invert the matrix?
np.linalg.solve(A, b) solves \(Ax = b\) without computing \(A^{-1}\) explicitly. It is faster (\(O(n^3/3)\) vs \(O(n^3)\)) and more numerically stable. Always prefer solve over inv @ b.
Implementation hints
TODO 11 — kernel_ridge_fit(K_train, y, lam)
You need to find alpha such that (K_train + lam * I) @ alpha = y. Two steps:
- Build the regularised matrix — get
n = K_train.shape[0], then addlamto every diagonal entry: - Solve the linear system with
np.linalg.solve(A, y). This findsalphawithout explicitly invertingA, which is both faster and numerically more stable thannp.linalg.inv(A) @ y.
Return alpha, shape (n,).
TODO 12 — kernel_ridge_predict(K_test_train, alpha)
Each test prediction is a weighted sum over the training kernel values:
Written as a matrix-vector product this is K_test_train @ alpha. K_test_train has shape (n_test, n_train) and alpha has shape (n_train,), so the result has shape (n_test,) — one prediction per test point.
TODO 13 — approximation_error(K_exact, K_approx)
The Frobenius norm \(\|M\|_F = \sqrt{\sum_{i,j} M_{ij}^2}\) — use np.linalg.norm(M, "fro"). Three lines:
numerator = np.linalg.norm(K_exact - K_approx, "fro")— how far apart the matrices are.denominator = np.linalg.norm(K_exact, "fro")— the scale of the exact matrix.return float(numerator / denominator)— the ratio, as a Python float.
A result of 0.0 means perfect approximation; 1.0 means the approximation is as bad as predicting all zeros; anything below 0.1 is generally considered good.
Exercise 6: Benchmarking
What to measure
For each method and each n, you need:
- Time: record
time.perf_counter()before and after building the kernel/feature matrix. - Memory: record
psutil.Process().memory_info().rssbefore and after, in MB. RSS can be noisy — take the max of the delta and the matrix's.nbytes / 1024**2.
import psutil
mem_before = psutil.Process().memory_info().rss / 1024**2
# ... build matrix ...
mem_after = psutil.Process().memory_info().rss / 1024**2
Implementation hints
TODO 14 — benchmark_methods(n_values, D_values, d, sigma)
The outer loop pairs each dataset size with its feature budget:
rng = np.random.default_rng(0)
results = []
for n, D in zip(n_values, D_values):
X = rng.standard_normal((n, d))
...
For each method, the timing and memory pattern is always the same three lines:
mem_before = psutil.Process().memory_info().rss / 1024**2
t0 = time.perf_counter()
# ... build kernel matrix or feature matrix Z here ...
time_s = time.perf_counter() - t0
mem_after = psutil.Process().memory_info().rss / 1024**2
Then append a dict to results:
results.append({
"method": "RFF", # or "Exact", "ORF", "Nyström"
"n": n,
"D": D,
"time_s": time_s,
"memory_mb": max(mem_after - mem_before, Z.nbytes / 1024**2),
"approx_error": approximation_error(K_exact, Z @ Z.T) if K_exact is not None else None,
})
The max(...) for memory handles cases where RSS doesn't increase (already-allocated memory reuse); .nbytes / 1024**2 gives the theoretical minimum.
Run the four methods in this order for each n:
- Exact (only if
n <= 3_000): callrbf_kernel_matrix(X, sigma). Setapprox_error = 0.0and keepK_exactfor the others. Ifn > 3_000, setK_exact = None. - RFF:
sample_rff_weights(D, d, sigma, rng)→rff_features(X, omega, b). - ORF:
sample_orf_weights(D, d, sigma, rng)→orf_features(X, omega, b). - Nyström:
select_landmarks(X, m=D, strategy="random", rng=rng)→nystrom_features(X, landmarks, sigma).
Return results after the loop.
plot_results(results) — already implemented for you. Call it with the output of benchmark_methods to produce the 3-panel figure.
Choosing σ: The Median Heuristic
A common practical choice is the median heuristic: set \(\sigma = \sqrt{\text{median}(\|x_i - x_j\|^2) / 2}\).
For \(n\) points drawn from \(\mathcal{N}(0, I_d)\), the expected squared distance is \(\mathbb{E}[\|x-y\|^2] = 2d\), so a reasonable default is \(\sigma \approx \sqrt{d}\).
With \(\sigma = 1\) and \(d = 10\), most off-diagonal kernel values are \(\approx e^{-10} \approx 0\) — the kernel is trivially diagonal and no approximation can do anything interesting. Always match \(\sigma\) to your data's scale.
Common Pitfalls
| Problem | Cause | Fix |
|---|---|---|
| Kernel matrix has negative eigenvalues | Floating point error in \(\|x-y\|^2\) | np.maximum(D2, 0) before exp |
| Nyström features are NaN | Near-zero eigenvalue in \(K_{mm}\) | Clip eigenvalues to max(λ, 1e-10) |
| RFF approximation has high variance | D too small | Increase D; try ORF |
np.linalg.solve raises LinAlgError |
\(K + \lambda I\) near-singular | Increase λ |
| Broadcasting error in feature map | Wrong transpose | Verify X @ omega.T gives (n, D) |
| Approximation worse than expected | σ too small for data dimensionality | Use median heuristic: \(\sigma \approx \sqrt{d}\) |
Complexity Summary
| Method | Feature map cost | Memory |
|---|---|---|
| Exact kernel | \(O(n^2 d)\) | \(O(n^2)\) |
| RFF / ORF | \(O(n D d)\) | \(O(nD)\) |
| Nyström | \(O(nm d + m^3)\) | \(O(nm)\) |
All three approximate methods break the \(O(n^2)\) wall. The choice between them:
- RFF: simplest, well-understood, scales to very large \(D\).
- ORF: same cost as RFF, lower variance — preferred when \(D \lesssim d\).
- Nyström: adapts to data geometry; great when the kernel has fast eigenvalue decay; requires an \(O(m^3)\) factorisation of \(K_{mm}\).
Key References
- Rahimi, A. & Recht, B. (2007). Random Features for Large-Scale Kernel Machines. NeurIPS.
- Yu, F. X., Suresh, A. T., Choromanski, K., Holtmann-Rice, D., & Kumar, S. (2016). Orthogonal Random Features. NeurIPS.
- Williams, C. & Seeger, M. (2001). Using the Nyström Method to Speed Up Kernel Machines. NeurIPS.