Learn Infinite-Width Limit: Formalization and Consequences

Swipe to show menu

To understand the infinite-width limit of neural networks, focus on fully connected architectures where each hidden layer contains an increasingly large number of neurons. The formal definition of the infinite-width limit considers a sequence of neural networks indexed by width, where the number of neurons per layer tends to infinity. The standard setup assumes all weights and biases are initialized independently from a zero-mean Gaussian distribution, with variances carefully scaled according to the layer size. Specifically, if $W_{ij}^{(l)}$ denotes the weight connecting neuron $j$ in layer $l-1$ to neuron $i$ in layer $l$ , the weights are initialized as $W_{ij}^{(l)} \sim N(0, \sigma_w^2 / n_{l-1})$ , where $n_{l-1}$ is the number of neurons in the previous layer and $\sigma_w^2$ is a constant. Biases are typically initialized as $b_i^{(l)} \sim N(0, \sigma_b^2)$ , with $\sigma_b^2$ another constant. This scaling ensures that the variance of pre-activations remains stable as the width grows.

The infinite-width limit is formally defined as the behavior of the neural network output as the widths of all hidden layers tend to infinity, under these initialization and scaling assumptions. In this regime, the distribution of outputs and internal activations can often be described by deterministic objects, such as Gaussian processes or kernels, rather than by the random weights themselves.

The intuition behind this phenomenon is rooted in the law of large numbers. In a wide neural network, each pre-activation (the sum of weighted inputs plus bias before applying the nonlinearity) is a sum of many independent random variables. As the number of terms in the sum increases, the central limit theorem ensures that the distribution of pre-activations becomes increasingly concentrated around its mean, with fluctuations diminishing as width grows. This means that, for a fixed input, the randomness due to weight initialization becomes negligible in the infinite-width limit: the pre-activations (and thus the outputs) concentrate sharply around their expected value.

The law of large numbers plays a critical role in this high-dimensional setting. Because each neuron in a layer aggregates information from an enormous number of inputs, the aggregate effect of the random weights averages out, leading to highly predictable, non-random behavior at initialization. This concentration of measure is a hallmark of infinite-width neural networks and underpins many of the theoretical simplifications that arise in this limit, such as the correspondence with Gaussian processes and the tractability of kernel-based analyses.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 1