Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Linearization of Neural Network Training | Neural Tangent Kernel and Training Dynamics
Neural Tangent Kernel Theory

bookLinearization of Neural Network Training

To understand how neural tangent kernel (NTK) theory describes the training dynamics of neural networks, you first need to examine how the network function evolves under gradient descent. Consider a neural network parameterized by a vector θθ, with its function denoted as f(x,θ)f(x, θ). When you train the network, the parameters θθ are updated from their initial values θ0θ₀ in order to minimize a loss function. The central question is: how does the output f(x,θ)f(x, θ) change as θθ evolves during training?

The key insight comes from linearizing the network function around its initialization. Using a first-order Taylor expansion, you can approximate the function at a parameter value θθ close to θ0θ₀ as follows:

Formal derivation of the first-order Taylor expansion of the neural network function around initialization:

The Taylor expansion of f(x,θ)f(x, θ) around θ0θ₀ gives:

f(x,θ)f(x,θ0)+θf(x,θ0)(θθ0)f(x, θ) ≈ f(x, θ₀) + ∇_θ f(x, θ₀) · (θ - θ₀)

Here, θf(x,θ0)∇_θ f(x, θ₀) is the gradient (Jacobian) of the network output with respect to the parameters, evaluated at initialization. The term (θθ0)(θ - θ₀) represents the change in the parameters during training. This linear approximation is valid when the parameter updates are small, meaning that the network remains close to its initial state.

This expansion shows that, to first order, the change in the network output during training is governed by the gradient of the output with respect to the parameters at initialization, multiplied by the parameter update.

This formalism leads to an intuitive picture for wide neural networks. As the width of the network increases, the parameter updates during training become smaller relative to the number of parameters. This phenomenon is sometimes called lazy training. In the infinite-width limit, the network function changes very little from its initial state; most of the learning happens by adjusting the output through the linear term in the Taylor expansion, rather than by moving to a fundamentally different function.

The reason for this behavior is that, in wide networks, the gradients with respect to the parameters become highly concentrated and stable at initialization. As a result, the network essentially learns by shifting its output in the direction prescribed by the initial gradients, rather than by changing the features it computes. The training trajectory stays in a small neighborhood of the initial parameters, making the linear approximation accurate throughout the optimization process.

However, for this linearization to faithfully describe training dynamics, several assumptions must hold:

  • The neural network must be sufficiently wide so that parameter updates remain small during training;
  • The learning rate must be chosen appropriately to avoid large jumps in parameter space;
  • The loss landscape near initialization should be smooth enough for the Taylor expansion to be valid;
  • The data distribution and input dimension should not cause the gradients to become degenerate or unstable.

When these conditions are satisfied, the linearized dynamics provide a powerful and tractable description of training, forming the basis for NTK theory. If any of these assumptions are violated, the network may undergo feature learning or deviate significantly from its initialization, and the linear approximation may break down.

question mark

According to NTK theory and the linearization of neural network training, what primarily governs the change in network output during training for very wide networks?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 1

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

bookLinearization of Neural Network Training

Свайпніть щоб показати меню

To understand how neural tangent kernel (NTK) theory describes the training dynamics of neural networks, you first need to examine how the network function evolves under gradient descent. Consider a neural network parameterized by a vector θθ, with its function denoted as f(x,θ)f(x, θ). When you train the network, the parameters θθ are updated from their initial values θ0θ₀ in order to minimize a loss function. The central question is: how does the output f(x,θ)f(x, θ) change as θθ evolves during training?

The key insight comes from linearizing the network function around its initialization. Using a first-order Taylor expansion, you can approximate the function at a parameter value θθ close to θ0θ₀ as follows:

Formal derivation of the first-order Taylor expansion of the neural network function around initialization:

The Taylor expansion of f(x,θ)f(x, θ) around θ0θ₀ gives:

f(x,θ)f(x,θ0)+θf(x,θ0)(θθ0)f(x, θ) ≈ f(x, θ₀) + ∇_θ f(x, θ₀) · (θ - θ₀)

Here, θf(x,θ0)∇_θ f(x, θ₀) is the gradient (Jacobian) of the network output with respect to the parameters, evaluated at initialization. The term (θθ0)(θ - θ₀) represents the change in the parameters during training. This linear approximation is valid when the parameter updates are small, meaning that the network remains close to its initial state.

This expansion shows that, to first order, the change in the network output during training is governed by the gradient of the output with respect to the parameters at initialization, multiplied by the parameter update.

This formalism leads to an intuitive picture for wide neural networks. As the width of the network increases, the parameter updates during training become smaller relative to the number of parameters. This phenomenon is sometimes called lazy training. In the infinite-width limit, the network function changes very little from its initial state; most of the learning happens by adjusting the output through the linear term in the Taylor expansion, rather than by moving to a fundamentally different function.

The reason for this behavior is that, in wide networks, the gradients with respect to the parameters become highly concentrated and stable at initialization. As a result, the network essentially learns by shifting its output in the direction prescribed by the initial gradients, rather than by changing the features it computes. The training trajectory stays in a small neighborhood of the initial parameters, making the linear approximation accurate throughout the optimization process.

However, for this linearization to faithfully describe training dynamics, several assumptions must hold:

  • The neural network must be sufficiently wide so that parameter updates remain small during training;
  • The learning rate must be chosen appropriately to avoid large jumps in parameter space;
  • The loss landscape near initialization should be smooth enough for the Taylor expansion to be valid;
  • The data distribution and input dimension should not cause the gradients to become degenerate or unstable.

When these conditions are satisfied, the linearized dynamics provide a powerful and tractable description of training, forming the basis for NTK theory. If any of these assumptions are violated, the network may undergo feature learning or deviate significantly from its initialization, and the linear approximation may break down.

question mark

According to NTK theory and the linearization of neural network training, what primarily governs the change in network output during training for very wide networks?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 1
some-alt