Learn Infinite-Width Neural Networks: Formalization and Assumptions

To understand the infinite-width limit of neural networks, consider a fully connected feedforward network with $L$ layers. Let $N_\ell$ denote the number of neurons in layer $\ell$ , where $N_0$ is the input dimension and $N_L$ is the output dimension. For each layer $\ell = 1, \ldots, L$ , let $W^{(\ell)} \in \mathbb{R}^{N_\ell \times N_{\ell-1}}$ be the weight matrix, $b^{(\ell)} \in \mathbb{R}^{N_\ell}$ the bias vector, and $h^{(\ell)} \in \mathbb{R}^{N_\ell}$ the pre-activation vector.

The activations $x^{(\ell)} \in \mathbb{R}^{N_\ell}$ are defined by applying a nonlinearity $\phi$ elementwise:

x^{(\ell)} = \phi!\left(h^{(\ell)}\right),

where $\phi$ may be ReLU, tanh, or another activation function.

Given an input $x^{(0)}$ , the forward pass is defined recursively by

h^{(\ell)} = W^{(\ell)} x^{(\ell-1)} + b^{(\ell)}, \qquad x^{(\ell)} = \phi!\left(h^{(\ell)}\right), \quad \ell = 1, \ldots, L.

The infinite-width limit corresponds to the regime in which the widths of all hidden layers,

N_1, N_2, \ldots, N_{L-1},

tend to infinity. In this limit, the network’s behavior is governed by statistical regularities arising from the law of large numbers. As a result, individual neurons become negligible, and the network can be described through a mean field formulation that captures the evolution of distributions rather than finite-dimensional parameter vectors.

Key Mathematical Assumptions for Mean Field Analysis

Weights and Biases are Initialized Independently: the weights $W^{(\ell)}_{ij}$ and biases $b^{(\ell)}_i$ are assumed to be initialized independently and identically distributed (i.i.d.) across all layers and neurons;
Initialization Scaling: weights are initialized as $W^{(\ell)}_{ij} \sim \mathcal{N}(0, \sigma_w^2 / N_{\ell-1})$ and biases as $b^{(\ell)}_i \sim \mathcal{N}(0, \sigma_b^2)$ , where $\sigma_w^2$ and $\sigma_b^2$ are fixed variances; this scaling ensures pre-activations $h^{(\ell)}$ remain of order one as width grows;
Activation Function Regularity: the activation function $\phi$ is chosen so that the moments of activations exist and remain well-behaved under the induced distributions;
Input Independence: the input $x^{(0)}$ has a fixed distribution independent of the network's width.

These assumptions ensure that as the width of each hidden layer increases, the network's behavior can be described using mean field theory.

The infinite-width limit is analytically tractable because, as the number of neurons in each layer grows, the central limit theorem implies that the pre-activations at each layer become Gaussian distributed, provided the independence and scaling assumptions hold. This allows the complex, high-dimensional behavior of the network to be described by the evolution of distributional quantities — such as means and variances — rather than by tracking every individual parameter. As a result, the network's behavior can be captured by deterministic equations in the limit of infinite width, greatly simplifying analysis. This tractability enables you to derive precise predictions about signal propagation, training dynamics, and generalization properties, and it forms the foundation for mean field theory in neural networks.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu