Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Linearization of Neural Network Training | Neural Tangent Kernel and Training Dynamics
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Neural Tangent Kernel Theory

bookLinearization of Neural Network Training

To understand how neural tangent kernel (NTK) theory describes the training dynamics of neural networks, you first need to examine how the network function evolves under gradient descent. Consider a neural network parameterized by a vector ΞΈΞΈ, with its function denoted as f(x,ΞΈ)f(x, ΞΈ). When you train the network, the parameters ΞΈΞΈ are updated from their initial values ΞΈ0ΞΈβ‚€ in order to minimize a loss function. The central question is: how does the output f(x,ΞΈ)f(x, ΞΈ) change as ΞΈΞΈ evolves during training?

The key insight comes from linearizing the network function around its initialization. Using a first-order Taylor expansion, you can approximate the function at a parameter value ΞΈΞΈ close to ΞΈ0ΞΈβ‚€ as follows:

Formal derivation of the first-order Taylor expansion of the neural network function around initialization:

The Taylor expansion of f(x,ΞΈ)f(x, ΞΈ) around ΞΈ0ΞΈβ‚€ gives:

f(x,ΞΈ)β‰ˆf(x,ΞΈ0)+βˆ‡ΞΈf(x,ΞΈ0)β‹…(ΞΈβˆ’ΞΈ0)f(x, ΞΈ) β‰ˆ f(x, ΞΈβ‚€) + βˆ‡_ΞΈ f(x, ΞΈβ‚€) Β· (ΞΈ - ΞΈβ‚€)

Here, βˆ‡ΞΈf(x,ΞΈ0)βˆ‡_ΞΈ f(x, ΞΈβ‚€) is the gradient (Jacobian) of the network output with respect to the parameters, evaluated at initialization. The term (ΞΈβˆ’ΞΈ0)(ΞΈ - ΞΈβ‚€) represents the change in the parameters during training. This linear approximation is valid when the parameter updates are small, meaning that the network remains close to its initial state.

This expansion shows that, to first order, the change in the network output during training is governed by the gradient of the output with respect to the parameters at initialization, multiplied by the parameter update.

This formalism leads to an intuitive picture for wide neural networks. As the width of the network increases, the parameter updates during training become smaller relative to the number of parameters. This phenomenon is sometimes called lazy training. In the infinite-width limit, the network function changes very little from its initial state; most of the learning happens by adjusting the output through the linear term in the Taylor expansion, rather than by moving to a fundamentally different function.

The reason for this behavior is that, in wide networks, the gradients with respect to the parameters become highly concentrated and stable at initialization. As a result, the network essentially learns by shifting its output in the direction prescribed by the initial gradients, rather than by changing the features it computes. The training trajectory stays in a small neighborhood of the initial parameters, making the linear approximation accurate throughout the optimization process.

However, for this linearization to faithfully describe training dynamics, several assumptions must hold:

  • The neural network must be sufficiently wide so that parameter updates remain small during training;
  • The learning rate must be chosen appropriately to avoid large jumps in parameter space;
  • The loss landscape near initialization should be smooth enough for the Taylor expansion to be valid;
  • The data distribution and input dimension should not cause the gradients to become degenerate or unstable.

When these conditions are satisfied, the linearized dynamics provide a powerful and tractable description of training, forming the basis for NTK theory. If any of these assumptions are violated, the network may undergo feature learning or deviate significantly from its initialization, and the linear approximation may break down.

question mark

According to NTK theory and the linearization of neural network training, what primarily governs the change in network output during training for very wide networks?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookLinearization of Neural Network Training

Swipe to show menu

To understand how neural tangent kernel (NTK) theory describes the training dynamics of neural networks, you first need to examine how the network function evolves under gradient descent. Consider a neural network parameterized by a vector ΞΈΞΈ, with its function denoted as f(x,ΞΈ)f(x, ΞΈ). When you train the network, the parameters ΞΈΞΈ are updated from their initial values ΞΈ0ΞΈβ‚€ in order to minimize a loss function. The central question is: how does the output f(x,ΞΈ)f(x, ΞΈ) change as ΞΈΞΈ evolves during training?

The key insight comes from linearizing the network function around its initialization. Using a first-order Taylor expansion, you can approximate the function at a parameter value ΞΈΞΈ close to ΞΈ0ΞΈβ‚€ as follows:

Formal derivation of the first-order Taylor expansion of the neural network function around initialization:

The Taylor expansion of f(x,ΞΈ)f(x, ΞΈ) around ΞΈ0ΞΈβ‚€ gives:

f(x,ΞΈ)β‰ˆf(x,ΞΈ0)+βˆ‡ΞΈf(x,ΞΈ0)β‹…(ΞΈβˆ’ΞΈ0)f(x, ΞΈ) β‰ˆ f(x, ΞΈβ‚€) + βˆ‡_ΞΈ f(x, ΞΈβ‚€) Β· (ΞΈ - ΞΈβ‚€)

Here, βˆ‡ΞΈf(x,ΞΈ0)βˆ‡_ΞΈ f(x, ΞΈβ‚€) is the gradient (Jacobian) of the network output with respect to the parameters, evaluated at initialization. The term (ΞΈβˆ’ΞΈ0)(ΞΈ - ΞΈβ‚€) represents the change in the parameters during training. This linear approximation is valid when the parameter updates are small, meaning that the network remains close to its initial state.

This expansion shows that, to first order, the change in the network output during training is governed by the gradient of the output with respect to the parameters at initialization, multiplied by the parameter update.

This formalism leads to an intuitive picture for wide neural networks. As the width of the network increases, the parameter updates during training become smaller relative to the number of parameters. This phenomenon is sometimes called lazy training. In the infinite-width limit, the network function changes very little from its initial state; most of the learning happens by adjusting the output through the linear term in the Taylor expansion, rather than by moving to a fundamentally different function.

The reason for this behavior is that, in wide networks, the gradients with respect to the parameters become highly concentrated and stable at initialization. As a result, the network essentially learns by shifting its output in the direction prescribed by the initial gradients, rather than by changing the features it computes. The training trajectory stays in a small neighborhood of the initial parameters, making the linear approximation accurate throughout the optimization process.

However, for this linearization to faithfully describe training dynamics, several assumptions must hold:

  • The neural network must be sufficiently wide so that parameter updates remain small during training;
  • The learning rate must be chosen appropriately to avoid large jumps in parameter space;
  • The loss landscape near initialization should be smooth enough for the Taylor expansion to be valid;
  • The data distribution and input dimension should not cause the gradients to become degenerate or unstable.

When these conditions are satisfied, the linearized dynamics provide a powerful and tractable description of training, forming the basis for NTK theory. If any of these assumptions are violated, the network may undergo feature learning or deviate significantly from its initialization, and the linear approximation may break down.

question mark

According to NTK theory and the linearization of neural network training, what primarily governs the change in network output during training for very wide networks?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 1
some-alt