Linearization of Neural Network Training
To understand how neural tangent kernel (NTK) theory describes the training dynamics of neural networks, you first need to examine how the network function evolves under gradient descent. Consider a neural network parameterized by a vector θ, with its function denoted as f(x,θ). When you train the network, the parameters θ are updated from their initial values θ0 in order to minimize a loss function. The central question is: how does the output f(x,θ) change as θ evolves during training?
The key insight comes from linearizing the network function around its initialization. Using a first-order Taylor expansion, you can approximate the function at a parameter value θ close to θ0 as follows:
Formal derivation of the first-order Taylor expansion of the neural network function around initialization:
The Taylor expansion of f(x,θ) around θ0 gives:
f(x,θ)≈f(x,θ0)+∇θf(x,θ0)⋅(θ−θ0)Here, ∇θf(x,θ0) is the gradient (Jacobian) of the network output with respect to the parameters, evaluated at initialization. The term (θ−θ0) represents the change in the parameters during training. This linear approximation is valid when the parameter updates are small, meaning that the network remains close to its initial state.
This expansion shows that, to first order, the change in the network output during training is governed by the gradient of the output with respect to the parameters at initialization, multiplied by the parameter update.
This formalism leads to an intuitive picture for wide neural networks. As the width of the network increases, the parameter updates during training become smaller relative to the number of parameters. This phenomenon is sometimes called lazy training. In the infinite-width limit, the network function changes very little from its initial state; most of the learning happens by adjusting the output through the linear term in the Taylor expansion, rather than by moving to a fundamentally different function.
The reason for this behavior is that, in wide networks, the gradients with respect to the parameters become highly concentrated and stable at initialization. As a result, the network essentially learns by shifting its output in the direction prescribed by the initial gradients, rather than by changing the features it computes. The training trajectory stays in a small neighborhood of the initial parameters, making the linear approximation accurate throughout the optimization process.
However, for this linearization to faithfully describe training dynamics, several assumptions must hold:
- The neural network must be sufficiently wide so that parameter updates remain small during training;
- The learning rate must be chosen appropriately to avoid large jumps in parameter space;
- The loss landscape near initialization should be smooth enough for the Taylor expansion to be valid;
- The data distribution and input dimension should not cause the gradients to become degenerate or unstable.
When these conditions are satisfied, the linearized dynamics provide a powerful and tractable description of training, forming the basis for NTK theory. If any of these assumptions are violated, the network may undergo feature learning or deviate significantly from its initialization, and the linear approximation may break down.
Danke für Ihr Feedback!
Fragen Sie AI
Fragen Sie AI
Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen
Großartig!
Completion Rate verbessert auf 11.11
Linearization of Neural Network Training
Swipe um das Menü anzuzeigen
To understand how neural tangent kernel (NTK) theory describes the training dynamics of neural networks, you first need to examine how the network function evolves under gradient descent. Consider a neural network parameterized by a vector θ, with its function denoted as f(x,θ). When you train the network, the parameters θ are updated from their initial values θ0 in order to minimize a loss function. The central question is: how does the output f(x,θ) change as θ evolves during training?
The key insight comes from linearizing the network function around its initialization. Using a first-order Taylor expansion, you can approximate the function at a parameter value θ close to θ0 as follows:
Formal derivation of the first-order Taylor expansion of the neural network function around initialization:
The Taylor expansion of f(x,θ) around θ0 gives:
f(x,θ)≈f(x,θ0)+∇θf(x,θ0)⋅(θ−θ0)Here, ∇θf(x,θ0) is the gradient (Jacobian) of the network output with respect to the parameters, evaluated at initialization. The term (θ−θ0) represents the change in the parameters during training. This linear approximation is valid when the parameter updates are small, meaning that the network remains close to its initial state.
This expansion shows that, to first order, the change in the network output during training is governed by the gradient of the output with respect to the parameters at initialization, multiplied by the parameter update.
This formalism leads to an intuitive picture for wide neural networks. As the width of the network increases, the parameter updates during training become smaller relative to the number of parameters. This phenomenon is sometimes called lazy training. In the infinite-width limit, the network function changes very little from its initial state; most of the learning happens by adjusting the output through the linear term in the Taylor expansion, rather than by moving to a fundamentally different function.
The reason for this behavior is that, in wide networks, the gradients with respect to the parameters become highly concentrated and stable at initialization. As a result, the network essentially learns by shifting its output in the direction prescribed by the initial gradients, rather than by changing the features it computes. The training trajectory stays in a small neighborhood of the initial parameters, making the linear approximation accurate throughout the optimization process.
However, for this linearization to faithfully describe training dynamics, several assumptions must hold:
- The neural network must be sufficiently wide so that parameter updates remain small during training;
- The learning rate must be chosen appropriately to avoid large jumps in parameter space;
- The loss landscape near initialization should be smooth enough for the Taylor expansion to be valid;
- The data distribution and input dimension should not cause the gradients to become degenerate or unstable.
When these conditions are satisfied, the linearized dynamics provide a powerful and tractable description of training, forming the basis for NTK theory. If any of these assumptions are violated, the network may undergo feature learning or deviate significantly from its initialization, and the linear approximation may break down.
Danke für Ihr Feedback!