Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Gaussian Process Correspondence at Initialization | Infinite-Width Neural Networks
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Neural Tangent Kernel Theory

bookGaussian Process Correspondence at Initialization

At initialization, a fully connected neural network with a large number of hidden units in each layer exhibits a remarkable property: as the width of each layer tends to infinity (nn \to \infty), the distribution over functions computed by the network converges to a Gaussian process (GP). This correspondence is foundational for understanding the statistical behavior of neural networks in the infinite-width regime.

To state this result precisely, consider a neural network with LL layers, where each layer has nn neurons and nn \to \infty. The weights and biases are initialized independently from zero-mean Gaussian distributions, and the activation function ϕ\phi is applied elementwise. The output f(x)f(x) of the network for input xx is therefore a random variable determined by the random initialization of the weights and biases.

The Gaussian process correspondence asserts that, under these conditions and for any finite set of inputs {x1,,xm}\{x_1, \dots, x_m\}, the joint distribution of the outputs {f(x1),,f(xm)}\{f(x_1), \dots, f(x_m)\} converges to a multivariate Gaussian as nn \to \infty. The mean is zero (assuming zero-mean initialization), and the covariance between f(x)f(x) and f(x)f(x') is determined recursively by the architecture and the activation function.

The key assumptions for this correspondence are:

  • All weights and biases are initialized independently from zero-mean Gaussian distributions with variances chosen to prevent signal explosion or decay;
  • The activation function ϕ\phi is measurable and satisfies mild growth conditions (such as bounded moments);
  • The width of each hidden layer tends to infinity, while the depth LL is fixed.

The derivation proceeds by noting that, due to the Central Limit Theorem, the pre-activations at each hidden layer become jointly Gaussian as the width increases, provided the previous layer's outputs are independent. This independence holds in the infinite-width limit, allowing you to recursively compute the covariance structure across layers.

The covariance structure of the limiting Gaussian process is deeply influenced by both the neural network architecture and the choice of activation function. For a simple fully connected network with one hidden layer, the covariance between outputs f(x)f(x) and f(x)f(x') at initialization is given by

K(1)(x,x)=σw2EzN(0,Σ(0))[ϕ(zx)ϕ(zx)]+σb2,K^{(1)}(x, x') = \sigma_w^2 \mathbb{E}_{z \sim \mathcal{N}(0, \Sigma^{(0)})}[\phi(z_x) \phi(z_{x'})] + \sigma_b^2,

where σw2σ_w^2 and σb2σ_b^2 are the variances of the weights and biases, respectively, and Σ(0)Σ^{(0)} is the input covariance matrix:

Σ(0)=(xxxxxxxx).\Sigma^{(0)} = \begin{pmatrix} x^\top x & x^\top x' \\ x'^\top x & x'^\top x' \end{pmatrix}.

For deeper networks, the covariance is computed recursively:

K(l+1)(x,x)=σw2E(u,v)N(0,Σ(l))[ϕ(u)ϕ(v)]+σb2,K^{(l+1)}(x, x') = \sigma_w^2 \mathbb{E}_{(u,v) \sim \mathcal{N}(0, \Sigma^{(l)})}[\phi(u)\phi(v)] + \sigma_b^2,

with Σ(l)Σ^{(l)} defined analogously using K(l)K^{(l)}.

The activation function φφ determines how the covariance evolves layer by layer. Using φ(z)=ReLU(z)φ(z) = ReLU(z) or φ(z)=tanh(z)φ(z) = tanh(z) leads to different forms of covariance propagation, resulting in distinct function space priors. The architecture — such as the presence of convolutional layers or skip connections — also alters the recursive structure and the resulting Gaussian process kernel.

The mapping from random initial weights to a distribution over functions can be visualized as a process where, for each random draw of weights and biases, the network defines a function f(x)f(x) from input space to output space. As the width increases, the randomness in the weights induces a distribution over possible functions, which becomes a Gaussian process in the infinite-width limit.

question mark

What happens to the distribution over functions computed by a fully connected neural network as the width of each layer tends to infinity?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 2

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Suggested prompts:

Can you explain why the infinite-width limit leads to a Gaussian process?

How does the choice of activation function affect the resulting Gaussian process?

Can you provide an example of how to compute the covariance for a specific activation function?

bookGaussian Process Correspondence at Initialization

Свайпніть щоб показати меню

At initialization, a fully connected neural network with a large number of hidden units in each layer exhibits a remarkable property: as the width of each layer tends to infinity (nn \to \infty), the distribution over functions computed by the network converges to a Gaussian process (GP). This correspondence is foundational for understanding the statistical behavior of neural networks in the infinite-width regime.

To state this result precisely, consider a neural network with LL layers, where each layer has nn neurons and nn \to \infty. The weights and biases are initialized independently from zero-mean Gaussian distributions, and the activation function ϕ\phi is applied elementwise. The output f(x)f(x) of the network for input xx is therefore a random variable determined by the random initialization of the weights and biases.

The Gaussian process correspondence asserts that, under these conditions and for any finite set of inputs {x1,,xm}\{x_1, \dots, x_m\}, the joint distribution of the outputs {f(x1),,f(xm)}\{f(x_1), \dots, f(x_m)\} converges to a multivariate Gaussian as nn \to \infty. The mean is zero (assuming zero-mean initialization), and the covariance between f(x)f(x) and f(x)f(x') is determined recursively by the architecture and the activation function.

The key assumptions for this correspondence are:

  • All weights and biases are initialized independently from zero-mean Gaussian distributions with variances chosen to prevent signal explosion or decay;
  • The activation function ϕ\phi is measurable and satisfies mild growth conditions (such as bounded moments);
  • The width of each hidden layer tends to infinity, while the depth LL is fixed.

The derivation proceeds by noting that, due to the Central Limit Theorem, the pre-activations at each hidden layer become jointly Gaussian as the width increases, provided the previous layer's outputs are independent. This independence holds in the infinite-width limit, allowing you to recursively compute the covariance structure across layers.

The covariance structure of the limiting Gaussian process is deeply influenced by both the neural network architecture and the choice of activation function. For a simple fully connected network with one hidden layer, the covariance between outputs f(x)f(x) and f(x)f(x') at initialization is given by

K(1)(x,x)=σw2EzN(0,Σ(0))[ϕ(zx)ϕ(zx)]+σb2,K^{(1)}(x, x') = \sigma_w^2 \mathbb{E}_{z \sim \mathcal{N}(0, \Sigma^{(0)})}[\phi(z_x) \phi(z_{x'})] + \sigma_b^2,

where σw2σ_w^2 and σb2σ_b^2 are the variances of the weights and biases, respectively, and Σ(0)Σ^{(0)} is the input covariance matrix:

Σ(0)=(xxxxxxxx).\Sigma^{(0)} = \begin{pmatrix} x^\top x & x^\top x' \\ x'^\top x & x'^\top x' \end{pmatrix}.

For deeper networks, the covariance is computed recursively:

K(l+1)(x,x)=σw2E(u,v)N(0,Σ(l))[ϕ(u)ϕ(v)]+σb2,K^{(l+1)}(x, x') = \sigma_w^2 \mathbb{E}_{(u,v) \sim \mathcal{N}(0, \Sigma^{(l)})}[\phi(u)\phi(v)] + \sigma_b^2,

with Σ(l)Σ^{(l)} defined analogously using K(l)K^{(l)}.

The activation function φφ determines how the covariance evolves layer by layer. Using φ(z)=ReLU(z)φ(z) = ReLU(z) or φ(z)=tanh(z)φ(z) = tanh(z) leads to different forms of covariance propagation, resulting in distinct function space priors. The architecture — such as the presence of convolutional layers or skip connections — also alters the recursive structure and the resulting Gaussian process kernel.

The mapping from random initial weights to a distribution over functions can be visualized as a process where, for each random draw of weights and biases, the network defines a function f(x)f(x) from input space to output space. As the width increases, the randomness in the weights induces a distribution over possible functions, which becomes a Gaussian process in the infinite-width limit.

question mark

What happens to the distribution over functions computed by a fully connected neural network as the width of each layer tends to infinity?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 2
some-alt