Infinite-Width Neural Networks: Formalization and Assumptions
To understand the infinite-width limit of neural networks, consider a fully connected feedforward network with L layers. Let Nββ denote the number of neurons in layer β, where N0β is the input dimension and NLβ is the output dimension. For each layer β=1,β¦,L, let W(β)βRNββΓNββ1β be the weight matrix, b(β)βRNββ the bias vector, and h(β)βRNββ the pre-activation vector.
The activations x(β)βRNββ are defined by applying a nonlinearity Ο elementwise:
x(β)=Ο!(h(β)),where Ο may be ReLU, tanh, or another activation function.
Given an input x(0), the forward pass is defined recursively by
h(β)=W(β)x(ββ1)+b(β),x(β)=Ο!(h(β)),β=1,β¦,L.The infinite-width limit corresponds to the regime in which the widths of all hidden layers,
N1β,N2β,β¦,NLβ1β,tend to infinity. In this limit, the networkβs behavior is governed by statistical regularities arising from the law of large numbers. As a result, individual neurons become negligible, and the network can be described through a mean field formulation that captures the evolution of distributions rather than finite-dimensional parameter vectors.
Key Mathematical Assumptions for Mean Field Analysis
- Weights and Biases are Initialized Independently: the weights Wij(β)β and biases bi(β)β are assumed to be initialized independently and identically distributed (i.i.d.) across all layers and neurons;
- Initialization Scaling: weights are initialized as Wij(β)ββΌN(0,Οw2β/Nββ1β) and biases as bi(β)ββΌN(0,Οb2β), where Οw2β and Οb2β are fixed variances; this scaling ensures pre-activations h(β) remain of order one as width grows;
- Activation Function Regularity: the activation function Ο is chosen so that the moments of activations exist and remain well-behaved under the induced distributions;
- Input Independence: the input x(0) has a fixed distribution independent of the network's width.
These assumptions ensure that as the width of each hidden layer increases, the network's behavior can be described using mean field theory.
The infinite-width limit is analytically tractable because, as the number of neurons in each layer grows, the central limit theorem implies that the pre-activations at each layer become Gaussian distributed, provided the independence and scaling assumptions hold. This allows the complex, high-dimensional behavior of the network to be described by the evolution of distributional quantities β such as means and variances β rather than by tracking every individual parameter. As a result, the network's behavior can be captured by deterministic equations in the limit of infinite width, greatly simplifying analysis. This tractability enables you to derive precise predictions about signal propagation, training dynamics, and generalization properties, and it forms the foundation for mean field theory in neural networks.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 11.11
Infinite-Width Neural Networks: Formalization and Assumptions
Swipe to show menu
To understand the infinite-width limit of neural networks, consider a fully connected feedforward network with L layers. Let Nββ denote the number of neurons in layer β, where N0β is the input dimension and NLβ is the output dimension. For each layer β=1,β¦,L, let W(β)βRNββΓNββ1β be the weight matrix, b(β)βRNββ the bias vector, and h(β)βRNββ the pre-activation vector.
The activations x(β)βRNββ are defined by applying a nonlinearity Ο elementwise:
x(β)=Ο!(h(β)),where Ο may be ReLU, tanh, or another activation function.
Given an input x(0), the forward pass is defined recursively by
h(β)=W(β)x(ββ1)+b(β),x(β)=Ο!(h(β)),β=1,β¦,L.The infinite-width limit corresponds to the regime in which the widths of all hidden layers,
N1β,N2β,β¦,NLβ1β,tend to infinity. In this limit, the networkβs behavior is governed by statistical regularities arising from the law of large numbers. As a result, individual neurons become negligible, and the network can be described through a mean field formulation that captures the evolution of distributions rather than finite-dimensional parameter vectors.
Key Mathematical Assumptions for Mean Field Analysis
- Weights and Biases are Initialized Independently: the weights Wij(β)β and biases bi(β)β are assumed to be initialized independently and identically distributed (i.i.d.) across all layers and neurons;
- Initialization Scaling: weights are initialized as Wij(β)ββΌN(0,Οw2β/Nββ1β) and biases as bi(β)ββΌN(0,Οb2β), where Οw2β and Οb2β are fixed variances; this scaling ensures pre-activations h(β) remain of order one as width grows;
- Activation Function Regularity: the activation function Ο is chosen so that the moments of activations exist and remain well-behaved under the induced distributions;
- Input Independence: the input x(0) has a fixed distribution independent of the network's width.
These assumptions ensure that as the width of each hidden layer increases, the network's behavior can be described using mean field theory.
The infinite-width limit is analytically tractable because, as the number of neurons in each layer grows, the central limit theorem implies that the pre-activations at each layer become Gaussian distributed, provided the independence and scaling assumptions hold. This allows the complex, high-dimensional behavior of the network to be described by the evolution of distributional quantities β such as means and variances β rather than by tracking every individual parameter. As a result, the network's behavior can be captured by deterministic equations in the limit of infinite width, greatly simplifying analysis. This tractability enables you to derive precise predictions about signal propagation, training dynamics, and generalization properties, and it forms the foundation for mean field theory in neural networks.
Thanks for your feedback!