Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Infinite-Width Limit: Formalization and Consequences | Infinite-Width Neural Networks
Neural Tangent Kernel Theory

bookInfinite-Width Limit: Formalization and Consequences

To understand the infinite-width limit of neural networks, focus on fully connected architectures where each hidden layer contains an increasingly large number of neurons. The formal definition of the infinite-width limit considers a sequence of neural networks indexed by width, where the number of neurons per layer tends to infinity. The standard setup assumes all weights and biases are initialized independently from a zero-mean Gaussian distribution, with variances carefully scaled according to the layer size. Specifically, if Wij(l)W_{ij}^{(l)} denotes the weight connecting neuron jj in layer lβˆ’1l-1 to neuron ii in layer ll, the weights are initialized as Wij(l)∼N(0,Οƒw2/nlβˆ’1)W_{ij}^{(l)} \sim N(0, \sigma_w^2 / n_{l-1}), where nlβˆ’1n_{l-1} is the number of neurons in the previous layer and Οƒw2\sigma_w^2 is a constant. Biases are typically initialized as bi(l)∼N(0,Οƒb2)b_i^{(l)} \sim N(0, \sigma_b^2), with Οƒb2\sigma_b^2 another constant. This scaling ensures that the variance of pre-activations remains stable as the width grows.

The infinite-width limit is formally defined as the behavior of the neural network output as the widths of all hidden layers tend to infinity, under these initialization and scaling assumptions. In this regime, the distribution of outputs and internal activations can often be described by deterministic objects, such as Gaussian processes or kernels, rather than by the random weights themselves.

The intuition behind this phenomenon is rooted in the law of large numbers. In a wide neural network, each pre-activation (the sum of weighted inputs plus bias before applying the nonlinearity) is a sum of many independent random variables. As the number of terms in the sum increases, the central limit theorem ensures that the distribution of pre-activations becomes increasingly concentrated around its mean, with fluctuations diminishing as width grows. This means that, for a fixed input, the randomness due to weight initialization becomes negligible in the infinite-width limit: the pre-activations (and thus the outputs) concentrate sharply around their expected value.

The law of large numbers plays a critical role in this high-dimensional setting. Because each neuron in a layer aggregates information from an enormous number of inputs, the aggregate effect of the random weights averages out, leading to highly predictable, non-random behavior at initialization. This concentration of measure is a hallmark of infinite-width neural networks and underpins many of the theoretical simplifications that arise in this limit, such as the correspondence with Gaussian processes and the tractability of kernel-based analyses.

question mark

Which statement best describes a key consequence of the infinite-width limit in fully connected neural networks?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookInfinite-Width Limit: Formalization and Consequences

Swipe to show menu

To understand the infinite-width limit of neural networks, focus on fully connected architectures where each hidden layer contains an increasingly large number of neurons. The formal definition of the infinite-width limit considers a sequence of neural networks indexed by width, where the number of neurons per layer tends to infinity. The standard setup assumes all weights and biases are initialized independently from a zero-mean Gaussian distribution, with variances carefully scaled according to the layer size. Specifically, if Wij(l)W_{ij}^{(l)} denotes the weight connecting neuron jj in layer lβˆ’1l-1 to neuron ii in layer ll, the weights are initialized as Wij(l)∼N(0,Οƒw2/nlβˆ’1)W_{ij}^{(l)} \sim N(0, \sigma_w^2 / n_{l-1}), where nlβˆ’1n_{l-1} is the number of neurons in the previous layer and Οƒw2\sigma_w^2 is a constant. Biases are typically initialized as bi(l)∼N(0,Οƒb2)b_i^{(l)} \sim N(0, \sigma_b^2), with Οƒb2\sigma_b^2 another constant. This scaling ensures that the variance of pre-activations remains stable as the width grows.

The infinite-width limit is formally defined as the behavior of the neural network output as the widths of all hidden layers tend to infinity, under these initialization and scaling assumptions. In this regime, the distribution of outputs and internal activations can often be described by deterministic objects, such as Gaussian processes or kernels, rather than by the random weights themselves.

The intuition behind this phenomenon is rooted in the law of large numbers. In a wide neural network, each pre-activation (the sum of weighted inputs plus bias before applying the nonlinearity) is a sum of many independent random variables. As the number of terms in the sum increases, the central limit theorem ensures that the distribution of pre-activations becomes increasingly concentrated around its mean, with fluctuations diminishing as width grows. This means that, for a fixed input, the randomness due to weight initialization becomes negligible in the infinite-width limit: the pre-activations (and thus the outputs) concentrate sharply around their expected value.

The law of large numbers plays a critical role in this high-dimensional setting. Because each neuron in a layer aggregates information from an enormous number of inputs, the aggregate effect of the random weights averages out, leading to highly predictable, non-random behavior at initialization. This concentration of measure is a hallmark of infinite-width neural networks and underpins many of the theoretical simplifications that arise in this limit, such as the correspondence with Gaussian processes and the tractability of kernel-based analyses.

question mark

Which statement best describes a key consequence of the infinite-width limit in fully connected neural networks?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 1
some-alt