Lære Neural Tangent Kernel and Linearization | Distributional and Dynamical Perspectives

Sveip for å vise menyen

The neural tangent kernel ( $NTK$ ) is a fundamental concept in the study of wide neural networks. It describes how the output of a neural network changes in response to small adjustments in its parameters, and becomes especially meaningful in the infinite-width limit. In this regime, the NTK captures the interplay between the architecture, the initialization, and the learning dynamics of the network. The NTK is formally defined as the matrix of inner products between the gradients of the network's output with respect to its parameters, evaluated at initialization. For a neural network function $f(x; θ)$ , where $x$ is the input and $θ$ represents the parameters, the NTK is given by:

\Theta(x, x') = \nabla_\theta f(x; \theta) \cdot \nabla_\theta f(x'; \theta)

In the infinite-width limit, the NTK becomes deterministic and constant during training, which dramatically simplifies the analysis of learning dynamics. This kernel encapsulates the way information is shared across data points during gradient descent, and serves as a bridge between neural networks and kernel methods.

To derive the neural tangent kernel (NTK) in the mean field limit, consider a fully-connected neural network with a large number of neurons per layer. As the width of each layer tends to infinity, the fluctuations in the network's weights vanish due to the law of large numbers, and the pre-activations at each layer become Gaussian-distributed. The network output $f(x; θ)$ can then be approximated by linearizing it around the random initialization $θ₀$ :

f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0) \cdot (\theta - \theta_0)

This linearization is valid throughout training in the infinite-width regime, since the parameter updates remain small and the NTK stays constant. The training dynamics under gradient descent are then governed by the NTK:

\frac{d}{dt} f(x; \theta_t) = - \sum_{i} \Theta(x, x_i) \frac{\partial \mathcal{L}}{\partial f(x_i; \theta_t)}

where $𝓛$ is the loss function and $xᵢ$ are the training inputs. This equation shows that the evolution of the network output is completely determined by the NTK, and the learning process can be viewed as kernel regression with the NTK as the kernel.

The relationship between the NTK and the evolution of network outputs can be visualized as follows:

make quiz for this chapter with 1 correct option (put the periods at the ends of the options)

This diagram illustrates how the neural tangent kernel arises from the gradients of the network outputs with respect to the parameters, and how it governs the trajectory of the outputs as training progresses.

The linearization of training dynamics through the NTK has profound implications for both expressivity and learning in wide neural networks. When the NTK is constant, the network's behavior during training is essentially linear, meaning that the function space explored is limited to the span of the gradients at initialization. This restricts the network's ability to learn highly non-linear transformations during training, but it also enables precise theoretical predictions about generalization and convergence rates. The NTK framework thus reveals a trade-off:

Infinite-width networks become analytically tractable and behave like kernel machines;
They may lose some of the powerful representational capabilities associated with finite-width, highly non-linear neural networks.

Understanding this trade-off is crucial for interpreting the strengths and limitations of mean field theory in deep learning.

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 2

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 2. Kapittel 2