Lære Gradient Descent Dynamics in Kernel Space | Neural Tangent Kernel and Training Dynamics

Sveip for å vise menyen

To understand how neural networks learn in the infinite-width regime, you need to analyze the dynamics of gradient descent in function space. Recall from the previous chapter that the neural tangent kernel (NTK), denoted as $\Theta(x, x')$ , captures how small changes in parameters affect the output function of the network. In the NTK regime, the kernel remains constant throughout training, which allows you to study the evolution of predictions analytically.

Consider a supervised learning problem with input data $X = [x_1, ..., x_n]$ and targets $y = [y_1, ..., y_n]$ . Let $f_t(x)$ denote the output of the neural network at time $t$ during training. The squared loss is given by $L = (1/2) \sum_i (f_t(x_i) - y_i)^2$ . Gradient descent updates the network parameters, which in turn updates the output function. In the NTK regime, the evolution of the output vector $f_t = [f_t(x_1), ..., f_t(x_n)]$ can be described by a differential equation:

df_t/dt = -\Theta (f_t - y)

where $\Theta$ is the empirical NTK matrix with entries $\Theta_{ij} = \Theta(x_i, x_j)$ . This equation shows that the change in predictions at each step is a linear combination of the current errors, weighted by the NTK.

Solving this linear ordinary differential equation yields:

f_t = y + \exp(-\Theta t) (f_0 - y)

where $f_0$ is the initial prediction. This formal derivation demonstrates that, in the NTK regime, the output function evolves linearly towards the target, driven by the structure of the kernel.

The most profound consequence of this analysis is the equivalence between gradient descent training of a neural network and kernel regression with the NTK in the infinite-width limit. As the network width tends to infinity, the NTK becomes deterministic and fixed, and the network's training trajectory is entirely governed by kernel dynamics. In this setting, training a neural network with gradient descent is mathematically equivalent to performing kernel ridge regression with the NTK as the kernel, and zero regularization. This means that the learned function is the minimum-norm interpolant in the RKHS (reproducing kernel Hilbert space) defined by the NTK, perfectly matching the training data in the infinite time limit.

To visualize how predictions evolve under kernel dynamics, imagine a diagram where the predicted outputs for each training point move smoothly from their initial values to the targets, following trajectories dictated by the NTK. Each prediction is influenced not only by its own error but also by the errors at all other points, with the influence strength determined by the kernel matrix. The collective motion of all predictions forms a flow in function space, converging to the kernel regression solution as training proceeds.

The NTK imposes a specific inductive bias on learning: it restricts the set of functions that can be efficiently learned to those that are "simple" in the sense of the RKHS norm induced by the kernel. Functions that align well with the principal components of the kernel are learned quickly, while more complex functions — those with large RKHS norm — are learned slowly or may not be learned at all. This has direct implications for generalization: the NTK determines which patterns in the data are favored during training, biasing the network toward solutions that are smooth with respect to the kernel. Understanding this bias is crucial for interpreting the successes and limitations of neural networks in the infinite-width regime, as it highlights the role of the kernel structure in shaping what the network can and cannot learn.

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 3

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 2. Kapittel 3