Learn Linearization of Neural Network Training | Neural Tangent Kernel and Training Dynamics

Swipe to show menu

To understand how neural tangent kernel (NTK) theory describes the training dynamics of neural networks, you first need to examine how the network function evolves under gradient descent. Consider a neural network parameterized by a vector $θ$ , with its function denoted as $f(x, θ)$ . When you train the network, the parameters $θ$ are updated from their initial values $θ₀$ in order to minimize a loss function. The central question is: how does the output $f(x, θ)$ change as $θ$ evolves during training?

The key insight comes from linearizing the network function around its initialization. Using a first-order Taylor expansion, you can approximate the function at a parameter value $θ$ close to $θ₀$ as follows:

Formal derivation of the first-order Taylor expansion of the neural network function around initialization:

The Taylor expansion of $f(x, θ)$ around $θ₀$ gives:

f(x, θ) ≈ f(x, θ₀) + ∇_θ f(x, θ₀) · (θ - θ₀)

Here, $∇_θ f(x, θ₀)$ is the gradient (Jacobian) of the network output with respect to the parameters, evaluated at initialization. The term $(θ - θ₀)$ represents the change in the parameters during training. This linear approximation is valid when the parameter updates are small, meaning that the network remains close to its initial state.

This expansion shows that, to first order, the change in the network output during training is governed by the gradient of the output with respect to the parameters at initialization, multiplied by the parameter update.

This formalism leads to an intuitive picture for wide neural networks. As the width of the network increases, the parameter updates during training become smaller relative to the number of parameters. This phenomenon is sometimes called lazy training. In the infinite-width limit, the network function changes very little from its initial state; most of the learning happens by adjusting the output through the linear term in the Taylor expansion, rather than by moving to a fundamentally different function.

The reason for this behavior is that, in wide networks, the gradients with respect to the parameters become highly concentrated and stable at initialization. As a result, the network essentially learns by shifting its output in the direction prescribed by the initial gradients, rather than by changing the features it computes. The training trajectory stays in a small neighborhood of the initial parameters, making the linear approximation accurate throughout the optimization process.

However, for this linearization to faithfully describe training dynamics, several assumptions must hold:

The neural network must be sufficiently wide so that parameter updates remain small during training;
The learning rate must be chosen appropriately to avoid large jumps in parameter space;
The loss landscape near initialization should be smooth enough for the Taylor expansion to be valid;
The data distribution and input dimension should not cause the gradients to become degenerate or unstable.

When these conditions are satisfied, the linearized dynamics provide a powerful and tractable description of training, forming the basis for NTK theory. If any of these assumptions are violated, the network may undergo feature learning or deviate significantly from its initialization, and the linear approximation may break down.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 1