Absence of Feature Learning in NTK Theory
To understand the absence of feature learning in Neural Tangent Kernel (NTK) theory, recall that in the infinite-width limit, a neural networkβs training dynamics can be described by a fixed, deterministic kernel. This kernel, known as the NTK, is determined entirely by the networkβs architecture and initialization statistics, not by data-driven adaptation during training. Formally, in the NTK regime, the kernel function Ξ(x,xβ²) remains constant throughout training, so the networkβs predictions evolve linearly in the space defined by this kernel. This linearization means that the underlying representationβthe mapping from inputs to features in hidden layersβdoes not change as the network learns. All learning occurs in the output layer, with the rest of the network acting as a static feature extractor determined at initialization. As a result, the possibility of adapting or discovering new features based on the data is precluded, and the network cannot perform feature learning in the sense of evolving its internal representations.
Intuitively, this absence of feature learning under NTK dynamics arises because, in the infinite-width limit, the gradients of the network with respect to its parameters become tightly concentrated around their mean values. This concentration causes the updates to the networkβs parameters to be so small and uniform that the hidden layer activations remain effectively unchanged during training. In other words, the network is βfrozenβ in its initial configuration, and only the linear combination of these fixed features is adjusted to fit the training data. Contrast this with finite-width neural networks, where the hidden layers can adapt and develop new data-dependent representationsβallowing for richer, hierarchical feature learning. In the NTK regime, however, the expressivity of the model is fundamentally limited to what can be achieved with the fixed, initial features, and the network cannot automatically discover new structures or representations as training progresses.
To visualize this difference, imagine side-by-side diagrams: on one side, a finite-width network where the hidden layer representations (depicted as evolving feature maps) change and become more specialized as training proceeds; on the other, an infinite-width NTK regime network, where the feature maps remain static throughout, and only the output weights are updated. The former illustrates genuine feature learning, while the latter highlights the fixed-kernel constraint that characterizes NTK theory.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain more about how NTK theory relates to real-world neural networks?
What are the practical implications of the absence of feature learning in NTK?
Can you provide examples where NTK theory fails to capture neural network behavior?
Awesome!
Completion rate improved to 11.11
Absence of Feature Learning in NTK Theory
Swipe to show menu
To understand the absence of feature learning in Neural Tangent Kernel (NTK) theory, recall that in the infinite-width limit, a neural networkβs training dynamics can be described by a fixed, deterministic kernel. This kernel, known as the NTK, is determined entirely by the networkβs architecture and initialization statistics, not by data-driven adaptation during training. Formally, in the NTK regime, the kernel function Ξ(x,xβ²) remains constant throughout training, so the networkβs predictions evolve linearly in the space defined by this kernel. This linearization means that the underlying representationβthe mapping from inputs to features in hidden layersβdoes not change as the network learns. All learning occurs in the output layer, with the rest of the network acting as a static feature extractor determined at initialization. As a result, the possibility of adapting or discovering new features based on the data is precluded, and the network cannot perform feature learning in the sense of evolving its internal representations.
Intuitively, this absence of feature learning under NTK dynamics arises because, in the infinite-width limit, the gradients of the network with respect to its parameters become tightly concentrated around their mean values. This concentration causes the updates to the networkβs parameters to be so small and uniform that the hidden layer activations remain effectively unchanged during training. In other words, the network is βfrozenβ in its initial configuration, and only the linear combination of these fixed features is adjusted to fit the training data. Contrast this with finite-width neural networks, where the hidden layers can adapt and develop new data-dependent representationsβallowing for richer, hierarchical feature learning. In the NTK regime, however, the expressivity of the model is fundamentally limited to what can be achieved with the fixed, initial features, and the network cannot automatically discover new structures or representations as training progresses.
To visualize this difference, imagine side-by-side diagrams: on one side, a finite-width network where the hidden layer representations (depicted as evolving feature maps) change and become more specialized as training proceeds; on the other, an infinite-width NTK regime network, where the feature maps remain static throughout, and only the output weights are updated. The former illustrates genuine feature learning, while the latter highlights the fixed-kernel constraint that characterizes NTK theory.
Thanks for your feedback!