Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Infinite-Width Neural Networks: Formalization and Assumptions | Mean Field Limits of Neural Networks
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Mean Field Theory for Neural Networks

bookInfinite-Width Neural Networks: Formalization and Assumptions

To understand the infinite-width limit of neural networks, consider a fully connected feedforward network with LL layers. Let NN_\ell denote the number of neurons in layer \ell, where N0N_0 is the input dimension and NLN_L is the output dimension. For each layer =1,,L\ell = 1, \ldots, L, let W()RN×N1W^{(\ell)} \in \mathbb{R}^{N_\ell \times N_{\ell-1}} be the weight matrix, b()RNb^{(\ell)} \in \mathbb{R}^{N_\ell} the bias vector, and h()RNh^{(\ell)} \in \mathbb{R}^{N_\ell} the pre-activation vector.

The activations x()RNx^{(\ell)} \in \mathbb{R}^{N_\ell} are defined by applying a nonlinearity ϕ\phi elementwise:

x()=ϕ!(h()),x^{(\ell)} = \phi!\left(h^{(\ell)}\right),

where ϕ\phi may be ReLU, tanh, or another activation function.

Given an input x(0)x^{(0)}, the forward pass is defined recursively by

h()=W()x(1)+b(),x()=ϕ!(h()),=1,,L.h^{(\ell)} = W^{(\ell)} x^{(\ell-1)} + b^{(\ell)}, \qquad x^{(\ell)} = \phi!\left(h^{(\ell)}\right), \quad \ell = 1, \ldots, L.

The infinite-width limit corresponds to the regime in which the widths of all hidden layers,

N1,N2,,NL1,N_1, N_2, \ldots, N_{L-1},

tend to infinity. In this limit, the network’s behavior is governed by statistical regularities arising from the law of large numbers. As a result, individual neurons become negligible, and the network can be described through a mean field formulation that captures the evolution of distributions rather than finite-dimensional parameter vectors.

Key Mathematical Assumptions for Mean Field Analysis

  • Weights and Biases are Initialized Independently: the weights Wij()W^{(\ell)}_{ij} and biases bi()b^{(\ell)}_i are assumed to be initialized independently and identically distributed (i.i.d.) across all layers and neurons;
  • Initialization Scaling: weights are initialized as Wij()N(0,σw2/N1)W^{(\ell)}_{ij} \sim \mathcal{N}(0, \sigma_w^2 / N_{\ell-1}) and biases as bi()N(0,σb2)b^{(\ell)}_i \sim \mathcal{N}(0, \sigma_b^2), where σw2\sigma_w^2 and σb2\sigma_b^2 are fixed variances; this scaling ensures pre-activations h()h^{(\ell)} remain of order one as width grows;
  • Activation Function Regularity: the activation function ϕ\phi is chosen so that the moments of activations exist and remain well-behaved under the induced distributions;
  • Input Independence: the input x(0)x^{(0)} has a fixed distribution independent of the network's width.

These assumptions ensure that as the width of each hidden layer increases, the network's behavior can be described using mean field theory.

The infinite-width limit is analytically tractable because, as the number of neurons in each layer grows, the central limit theorem implies that the pre-activations at each layer become Gaussian distributed, provided the independence and scaling assumptions hold. This allows the complex, high-dimensional behavior of the network to be described by the evolution of distributional quantities — such as means and variances — rather than by tracking every individual parameter. As a result, the network's behavior can be captured by deterministic equations in the limit of infinite width, greatly simplifying analysis. This tractability enables you to derive precise predictions about signal propagation, training dynamics, and generalization properties, and it forms the foundation for mean field theory in neural networks.

question mark

What does the infinite-width limit mean in the context of neural networks?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 1

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

bookInfinite-Width Neural Networks: Formalization and Assumptions

Sveip for å vise menyen

To understand the infinite-width limit of neural networks, consider a fully connected feedforward network with LL layers. Let NN_\ell denote the number of neurons in layer \ell, where N0N_0 is the input dimension and NLN_L is the output dimension. For each layer =1,,L\ell = 1, \ldots, L, let W()RN×N1W^{(\ell)} \in \mathbb{R}^{N_\ell \times N_{\ell-1}} be the weight matrix, b()RNb^{(\ell)} \in \mathbb{R}^{N_\ell} the bias vector, and h()RNh^{(\ell)} \in \mathbb{R}^{N_\ell} the pre-activation vector.

The activations x()RNx^{(\ell)} \in \mathbb{R}^{N_\ell} are defined by applying a nonlinearity ϕ\phi elementwise:

x()=ϕ!(h()),x^{(\ell)} = \phi!\left(h^{(\ell)}\right),

where ϕ\phi may be ReLU, tanh, or another activation function.

Given an input x(0)x^{(0)}, the forward pass is defined recursively by

h()=W()x(1)+b(),x()=ϕ!(h()),=1,,L.h^{(\ell)} = W^{(\ell)} x^{(\ell-1)} + b^{(\ell)}, \qquad x^{(\ell)} = \phi!\left(h^{(\ell)}\right), \quad \ell = 1, \ldots, L.

The infinite-width limit corresponds to the regime in which the widths of all hidden layers,

N1,N2,,NL1,N_1, N_2, \ldots, N_{L-1},

tend to infinity. In this limit, the network’s behavior is governed by statistical regularities arising from the law of large numbers. As a result, individual neurons become negligible, and the network can be described through a mean field formulation that captures the evolution of distributions rather than finite-dimensional parameter vectors.

Key Mathematical Assumptions for Mean Field Analysis

  • Weights and Biases are Initialized Independently: the weights Wij()W^{(\ell)}_{ij} and biases bi()b^{(\ell)}_i are assumed to be initialized independently and identically distributed (i.i.d.) across all layers and neurons;
  • Initialization Scaling: weights are initialized as Wij()N(0,σw2/N1)W^{(\ell)}_{ij} \sim \mathcal{N}(0, \sigma_w^2 / N_{\ell-1}) and biases as bi()N(0,σb2)b^{(\ell)}_i \sim \mathcal{N}(0, \sigma_b^2), where σw2\sigma_w^2 and σb2\sigma_b^2 are fixed variances; this scaling ensures pre-activations h()h^{(\ell)} remain of order one as width grows;
  • Activation Function Regularity: the activation function ϕ\phi is chosen so that the moments of activations exist and remain well-behaved under the induced distributions;
  • Input Independence: the input x(0)x^{(0)} has a fixed distribution independent of the network's width.

These assumptions ensure that as the width of each hidden layer increases, the network's behavior can be described using mean field theory.

The infinite-width limit is analytically tractable because, as the number of neurons in each layer grows, the central limit theorem implies that the pre-activations at each layer become Gaussian distributed, provided the independence and scaling assumptions hold. This allows the complex, high-dimensional behavior of the network to be described by the evolution of distributional quantities — such as means and variances — rather than by tracking every individual parameter. As a result, the network's behavior can be captured by deterministic equations in the limit of infinite width, greatly simplifying analysis. This tractability enables you to derive precise predictions about signal propagation, training dynamics, and generalization properties, and it forms the foundation for mean field theory in neural networks.

question mark

What does the infinite-width limit mean in the context of neural networks?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 1
some-alt