Oppiskele Definition and Properties of the Neural Tangent Kernel | Neural Tangent Kernel and Training Dynamics

Pyyhkäise näyttääksesi valikon

The Neural Tangent Kernel (NTK) is a fundamental concept in understanding the training dynamics of wide neural networks. Formally, given a neural network function $f(θ, x)$ with parameters $θ$ and input $x$ , the NTK is defined as the inner product of the Jacobians of the network output with respect to its parameters, evaluated at possibly different inputs. Specifically, for inputs $x$ and $x'$ , the NTK is given by:

\Theta(x, x') = \nabla_\theta f(\theta, x) \cdot \nabla_\theta f(\theta, x')^\top

where $∇_θ f(θ, x)$ denotes the gradient (Jacobian) of the network output with respect to its parameters at input $x$ . The NTK captures how changes in the parameters affect the outputs at different inputs, and thus encodes the geometry of function space induced by the network architecture and initialization.

To see how the NTK arises in practice, consider a simple fully connected neural network with a single hidden layer. Let the network output be $f(θ, x) = vᵗ φ(Wx)$ , where $W$ is the weight matrix of the hidden layer, $v$ is the output weight vector, and $φ$ is a pointwise nonlinearity. From the linearization discussed previously, the network function can be approximated near initialization by its first-order Taylor expansion in $θ$ :

f(\theta, x) \approx f(\theta_0, x) + \nabla_\theta f(\theta_0, x) \cdot (\theta - \theta_0)

The NTK for this network, at initialization, is thus:

\Theta(x, x') = \nabla_\theta f(\theta_0, x) \cdot \nabla_\theta f(\theta_0, x')^\top

Expanding this, the NTK can be written as the sum of contributions from the gradients with respect to both $W$ and $v$ . For large hidden layer width, the NTK converges to a deterministic kernel that depends only on the input statistics and the choice of nonlinearity $φ$ .

A useful way to visualize the Neural Tangent Kernel (NTK) is as a kernel in function space, mapping pairs of inputs to real numbers that quantify how parameter changes influence outputs.

This kernel structure is central to understanding how neural networks behave in the infinite-width regime, where training dynamics can be described entirely in terms of the NTK.

Several important properties characterize the NTK and its implications for training:

Invariance: for certain architectures and choices of nonlinearity, the NTK is invariant to input transformations such as permutations or orthogonal rotations, provided the network weights are initialized with appropriate symmetries;
Stationarity: in translation-invariant architectures (like convolutional networks), the NTK may become a stationary kernel, depending only on relative positions of inputs rather than their absolute coordinates;
Constancy in the infinite-width limit: as the width of the network increases, the NTK converges to a fixed kernel that does not change during training, leading to linearized training dynamics;
Role in training: the NTK determines how fast and in what directions the network function changes during gradient descent, fully characterizing training dynamics in the infinite-width regime.

These properties highlight the NTK's central role in connecting neural network architectures, their symmetries, and the resulting learning behavior.

Several important properties characterize the NTK and its implications for training:

Invariance: For certain architectures and choices of nonlinearity, the NTK is invariant to input transformations such as permutations or orthogonal rotations, provided the network weights are initialized with appropriate symmetries;
Stationarity: In translation-invariant architectures (like convolutional networks), the NTK may become a stationary kernel, depending only on relative positions of inputs rather than their absolute coordinates;
Constancy in the infinite-width limit: As the width of the network increases, the NTK converges to a fixed kernel that does not change during training, leading to linearized training dynamics;
Role in training: The NTK determines how fast and in what directions the network function changes during gradient descent, fully characterizing training dynamics in the infinite-width regime.

These properties highlight the NTK's central role in connecting neural network architectures, their symmetries, and the resulting learning behavior.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 2

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 2. Luku 2