Вивчайте Gaussian Process Correspondence at Initialization

Свайпніть щоб показати меню

At initialization, a fully connected neural network with a large number of hidden units in each layer exhibits a remarkable property: as the width of each layer tends to infinity ( $n \to \infty$ ), the distribution over functions computed by the network converges to a Gaussian process (GP). This correspondence is foundational for understanding the statistical behavior of neural networks in the infinite-width regime.

To state this result precisely, consider a neural network with $L$ layers, where each layer has $n$ neurons and $n \to \infty$ . The weights and biases are initialized independently from zero-mean Gaussian distributions, and the activation function $\phi$ is applied elementwise. The output $f(x)$ of the network for input $x$ is therefore a random variable determined by the random initialization of the weights and biases.

The Gaussian process correspondence asserts that, under these conditions and for any finite set of inputs $\{x_1, \dots, x_m\}$ , the joint distribution of the outputs $\{f(x_1), \dots, f(x_m)\}$ converges to a multivariate Gaussian as $n \to \infty$ . The mean is zero (assuming zero-mean initialization), and the covariance between $f(x)$ and $f(x')$ is determined recursively by the architecture and the activation function.

The key assumptions for this correspondence are:

All weights and biases are initialized independently from zero-mean Gaussian distributions with variances chosen to prevent signal explosion or decay;
The activation function $\phi$ is measurable and satisfies mild growth conditions (such as bounded moments);
The width of each hidden layer tends to infinity, while the depth $L$ is fixed.

The derivation proceeds by noting that, due to the Central Limit Theorem, the pre-activations at each hidden layer become jointly Gaussian as the width increases, provided the previous layer's outputs are independent. This independence holds in the infinite-width limit, allowing you to recursively compute the covariance structure across layers.

The covariance structure of the limiting Gaussian process is deeply influenced by both the neural network architecture and the choice of activation function. For a simple fully connected network with one hidden layer, the covariance between outputs $f(x)$ and $f(x')$ at initialization is given by

K^{(1)}(x, x') = \sigma_w^2 \mathbb{E}_{z \sim \mathcal{N}(0, \Sigma^{(0)})}[\phi(z_x) \phi(z_{x'})] + \sigma_b^2,

where $σ_w^2$ and $σ_b^2$ are the variances of the weights and biases, respectively, and $Σ^{(0)}$ is the input covariance matrix:

\Sigma^{(0)} = \begin{pmatrix} x^\top x & x^\top x' \\ x'^\top x & x'^\top x' \end{pmatrix}.

For deeper networks, the covariance is computed recursively:

K^{(l+1)}(x, x') = \sigma_w^2 \mathbb{E}_{(u,v) \sim \mathcal{N}(0, \Sigma^{(l)})}[\phi(u)\phi(v)] + \sigma_b^2,

with $Σ^{(l)}$ defined analogously using $K^{(l)}$ .

The activation function $φ$ determines how the covariance evolves layer by layer. Using $φ(z) = ReLU(z)$ or $φ(z) = tanh(z)$ leads to different forms of covariance propagation, resulting in distinct function space priors. The architecture — such as the presence of convolutional layers or skip connections — also alters the recursive structure and the resulting Gaussian process kernel.

The mapping from random initial weights to a distribution over functions can be visualized as a process where, for each random draw of weights and biases, the network defines a function $f(x)$ from input space to output space. As the width increases, the randomness in the weights induces a distribution over possible functions, which becomes a Gaussian process in the infinite-width limit.

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 1. Розділ 2

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Секція 1. Розділ 2