Learn Absence of Feature Learning in NTK Theory | Explanatory Power and Limitations of NTK

Swipe to show menu

To understand the absence of feature learning in Neural Tangent Kernel (NTK) theory, recall that in the infinite-width limit, a neural network’s training dynamics can be described by a fixed, deterministic kernel. This kernel, known as the NTK, is determined entirely by the network’s architecture and initialization statistics, not by data-driven adaptation during training. Formally, in the NTK regime, the kernel function $Θ(x, x')$ remains constant throughout training, so the network’s predictions evolve linearly in the space defined by this kernel. This linearization means that the underlying representation—the mapping from inputs to features in hidden layers—does not change as the network learns. All learning occurs in the output layer, with the rest of the network acting as a static feature extractor determined at initialization. As a result, the possibility of adapting or discovering new features based on the data is precluded, and the network cannot perform feature learning in the sense of evolving its internal representations.

Intuitively, this absence of feature learning under NTK dynamics arises because, in the infinite-width limit, the gradients of the network with respect to its parameters become tightly concentrated around their mean values. This concentration causes the updates to the network’s parameters to be so small and uniform that the hidden layer activations remain effectively unchanged during training. In other words, the network is “frozen” in its initial configuration, and only the linear combination of these fixed features is adjusted to fit the training data. Contrast this with finite-width neural networks, where the hidden layers can adapt and develop new data-dependent representations—allowing for richer, hierarchical feature learning. In the NTK regime, however, the expressivity of the model is fundamentally limited to what can be achieved with the fixed, initial features, and the network cannot automatically discover new structures or representations as training progresses.

To visualize this difference, imagine side-by-side diagrams: on one side, a finite-width network where the hidden layer representations (depicted as evolving feature maps) change and become more specialized as training proceeds; on the other, an infinite-width NTK regime network, where the feature maps remain static throughout, and only the output weights are updated. The former illustrates genuine feature learning, while the latter highlights the fixed-kernel constraint that characterizes NTK theory.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 2