Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Infinite-Width Neural Networks: Formalization and Assumptions | Mean Field Limits of Neural Networks
Mean Field Theory for Neural Networks

bookInfinite-Width Neural Networks: Formalization and Assumptions

To understand the infinite-width limit of neural networks, consider a fully connected feedforward network with LL layers. Let Nβ„“N_\ell denote the number of neurons in layer β„“\ell, where N0N_0 is the input dimension and NLN_L is the output dimension. For each layer β„“=1,…,L\ell = 1, \ldots, L, let W(β„“)∈RNβ„“Γ—Nβ„“βˆ’1W^{(\ell)} \in \mathbb{R}^{N_\ell \times N_{\ell-1}} be the weight matrix, b(β„“)∈RNβ„“b^{(\ell)} \in \mathbb{R}^{N_\ell} the bias vector, and h(β„“)∈RNβ„“h^{(\ell)} \in \mathbb{R}^{N_\ell} the pre-activation vector.

The activations x(β„“)∈RNβ„“x^{(\ell)} \in \mathbb{R}^{N_\ell} are defined by applying a nonlinearity Ο•\phi elementwise:

x(β„“)=Ο•!(h(β„“)),x^{(\ell)} = \phi!\left(h^{(\ell)}\right),

where Ο•\phi may be ReLU, tanh, or another activation function.

Given an input x(0)x^{(0)}, the forward pass is defined recursively by

h(β„“)=W(β„“)x(β„“βˆ’1)+b(β„“),x(β„“)=Ο•!(h(β„“)),β„“=1,…,L.h^{(\ell)} = W^{(\ell)} x^{(\ell-1)} + b^{(\ell)}, \qquad x^{(\ell)} = \phi!\left(h^{(\ell)}\right), \quad \ell = 1, \ldots, L.

The infinite-width limit corresponds to the regime in which the widths of all hidden layers,

N1,N2,…,NLβˆ’1,N_1, N_2, \ldots, N_{L-1},

tend to infinity. In this limit, the network’s behavior is governed by statistical regularities arising from the law of large numbers. As a result, individual neurons become negligible, and the network can be described through a mean field formulation that captures the evolution of distributions rather than finite-dimensional parameter vectors.

Key Mathematical Assumptions for Mean Field Analysis

  • Weights and Biases are Initialized Independently: the weights Wij(β„“)W^{(\ell)}_{ij} and biases bi(β„“)b^{(\ell)}_i are assumed to be initialized independently and identically distributed (i.i.d.) across all layers and neurons;
  • Initialization Scaling: weights are initialized as Wij(β„“)∼N(0,Οƒw2/Nβ„“βˆ’1)W^{(\ell)}_{ij} \sim \mathcal{N}(0, \sigma_w^2 / N_{\ell-1}) and biases as bi(β„“)∼N(0,Οƒb2)b^{(\ell)}_i \sim \mathcal{N}(0, \sigma_b^2), where Οƒw2\sigma_w^2 and Οƒb2\sigma_b^2 are fixed variances; this scaling ensures pre-activations h(β„“)h^{(\ell)} remain of order one as width grows;
  • Activation Function Regularity: the activation function Ο•\phi is chosen so that the moments of activations exist and remain well-behaved under the induced distributions;
  • Input Independence: the input x(0)x^{(0)} has a fixed distribution independent of the network's width.

These assumptions ensure that as the width of each hidden layer increases, the network's behavior can be described using mean field theory.

The infinite-width limit is analytically tractable because, as the number of neurons in each layer grows, the central limit theorem implies that the pre-activations at each layer become Gaussian distributed, provided the independence and scaling assumptions hold. This allows the complex, high-dimensional behavior of the network to be described by the evolution of distributional quantities β€” such as means and variances β€” rather than by tracking every individual parameter. As a result, the network's behavior can be captured by deterministic equations in the limit of infinite width, greatly simplifying analysis. This tractability enables you to derive precise predictions about signal propagation, training dynamics, and generalization properties, and it forms the foundation for mean field theory in neural networks.

question mark

What does the infinite-width limit mean in the context of neural networks?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookInfinite-Width Neural Networks: Formalization and Assumptions

Swipe to show menu

To understand the infinite-width limit of neural networks, consider a fully connected feedforward network with LL layers. Let Nβ„“N_\ell denote the number of neurons in layer β„“\ell, where N0N_0 is the input dimension and NLN_L is the output dimension. For each layer β„“=1,…,L\ell = 1, \ldots, L, let W(β„“)∈RNβ„“Γ—Nβ„“βˆ’1W^{(\ell)} \in \mathbb{R}^{N_\ell \times N_{\ell-1}} be the weight matrix, b(β„“)∈RNβ„“b^{(\ell)} \in \mathbb{R}^{N_\ell} the bias vector, and h(β„“)∈RNβ„“h^{(\ell)} \in \mathbb{R}^{N_\ell} the pre-activation vector.

The activations x(β„“)∈RNβ„“x^{(\ell)} \in \mathbb{R}^{N_\ell} are defined by applying a nonlinearity Ο•\phi elementwise:

x(β„“)=Ο•!(h(β„“)),x^{(\ell)} = \phi!\left(h^{(\ell)}\right),

where Ο•\phi may be ReLU, tanh, or another activation function.

Given an input x(0)x^{(0)}, the forward pass is defined recursively by

h(β„“)=W(β„“)x(β„“βˆ’1)+b(β„“),x(β„“)=Ο•!(h(β„“)),β„“=1,…,L.h^{(\ell)} = W^{(\ell)} x^{(\ell-1)} + b^{(\ell)}, \qquad x^{(\ell)} = \phi!\left(h^{(\ell)}\right), \quad \ell = 1, \ldots, L.

The infinite-width limit corresponds to the regime in which the widths of all hidden layers,

N1,N2,…,NLβˆ’1,N_1, N_2, \ldots, N_{L-1},

tend to infinity. In this limit, the network’s behavior is governed by statistical regularities arising from the law of large numbers. As a result, individual neurons become negligible, and the network can be described through a mean field formulation that captures the evolution of distributions rather than finite-dimensional parameter vectors.

Key Mathematical Assumptions for Mean Field Analysis

  • Weights and Biases are Initialized Independently: the weights Wij(β„“)W^{(\ell)}_{ij} and biases bi(β„“)b^{(\ell)}_i are assumed to be initialized independently and identically distributed (i.i.d.) across all layers and neurons;
  • Initialization Scaling: weights are initialized as Wij(β„“)∼N(0,Οƒw2/Nβ„“βˆ’1)W^{(\ell)}_{ij} \sim \mathcal{N}(0, \sigma_w^2 / N_{\ell-1}) and biases as bi(β„“)∼N(0,Οƒb2)b^{(\ell)}_i \sim \mathcal{N}(0, \sigma_b^2), where Οƒw2\sigma_w^2 and Οƒb2\sigma_b^2 are fixed variances; this scaling ensures pre-activations h(β„“)h^{(\ell)} remain of order one as width grows;
  • Activation Function Regularity: the activation function Ο•\phi is chosen so that the moments of activations exist and remain well-behaved under the induced distributions;
  • Input Independence: the input x(0)x^{(0)} has a fixed distribution independent of the network's width.

These assumptions ensure that as the width of each hidden layer increases, the network's behavior can be described using mean field theory.

The infinite-width limit is analytically tractable because, as the number of neurons in each layer grows, the central limit theorem implies that the pre-activations at each layer become Gaussian distributed, provided the independence and scaling assumptions hold. This allows the complex, high-dimensional behavior of the network to be described by the evolution of distributional quantities β€” such as means and variances β€” rather than by tracking every individual parameter. As a result, the network's behavior can be captured by deterministic equations in the limit of infinite width, greatly simplifying analysis. This tractability enables you to derive precise predictions about signal propagation, training dynamics, and generalization properties, and it forms the foundation for mean field theory in neural networks.

question mark

What does the infinite-width limit mean in the context of neural networks?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 1
some-alt