Absence of Feature Learning in NTK Theory
To understand the absence of feature learning in Neural Tangent Kernel (NTK) theory, recall that in the infinite-width limit, a neural network’s training dynamics can be described by a fixed, deterministic kernel. This kernel, known as the NTK, is determined entirely by the network’s architecture and initialization statistics, not by data-driven adaptation during training. Formally, in the NTK regime, the kernel function Θ(x,x′) remains constant throughout training, so the network’s predictions evolve linearly in the space defined by this kernel. This linearization means that the underlying representation—the mapping from inputs to features in hidden layers—does not change as the network learns. All learning occurs in the output layer, with the rest of the network acting as a static feature extractor determined at initialization. As a result, the possibility of adapting or discovering new features based on the data is precluded, and the network cannot perform feature learning in the sense of evolving its internal representations.
Intuitively, this absence of feature learning under NTK dynamics arises because, in the infinite-width limit, the gradients of the network with respect to its parameters become tightly concentrated around their mean values. This concentration causes the updates to the network’s parameters to be so small and uniform that the hidden layer activations remain effectively unchanged during training. In other words, the network is “frozen” in its initial configuration, and only the linear combination of these fixed features is adjusted to fit the training data. Contrast this with finite-width neural networks, where the hidden layers can adapt and develop new data-dependent representations—allowing for richer, hierarchical feature learning. In the NTK regime, however, the expressivity of the model is fundamentally limited to what can be achieved with the fixed, initial features, and the network cannot automatically discover new structures or representations as training progresses.
To visualize this difference, imagine side-by-side diagrams: on one side, a finite-width network where the hidden layer representations (depicted as evolving feature maps) change and become more specialized as training proceeds; on the other, an infinite-width NTK regime network, where the feature maps remain static throughout, and only the output weights are updated. The former illustrates genuine feature learning, while the latter highlights the fixed-kernel constraint that characterizes NTK theory.
Danke für Ihr Feedback!
Fragen Sie AI
Fragen Sie AI
Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen
Can you explain more about how NTK theory relates to real-world neural networks?
What are the practical implications of the absence of feature learning in NTK?
Can you provide examples where NTK theory fails to capture neural network behavior?
Großartig!
Completion Rate verbessert auf 11.11
Absence of Feature Learning in NTK Theory
Swipe um das Menü anzuzeigen
To understand the absence of feature learning in Neural Tangent Kernel (NTK) theory, recall that in the infinite-width limit, a neural network’s training dynamics can be described by a fixed, deterministic kernel. This kernel, known as the NTK, is determined entirely by the network’s architecture and initialization statistics, not by data-driven adaptation during training. Formally, in the NTK regime, the kernel function Θ(x,x′) remains constant throughout training, so the network’s predictions evolve linearly in the space defined by this kernel. This linearization means that the underlying representation—the mapping from inputs to features in hidden layers—does not change as the network learns. All learning occurs in the output layer, with the rest of the network acting as a static feature extractor determined at initialization. As a result, the possibility of adapting or discovering new features based on the data is precluded, and the network cannot perform feature learning in the sense of evolving its internal representations.
Intuitively, this absence of feature learning under NTK dynamics arises because, in the infinite-width limit, the gradients of the network with respect to its parameters become tightly concentrated around their mean values. This concentration causes the updates to the network’s parameters to be so small and uniform that the hidden layer activations remain effectively unchanged during training. In other words, the network is “frozen” in its initial configuration, and only the linear combination of these fixed features is adjusted to fit the training data. Contrast this with finite-width neural networks, where the hidden layers can adapt and develop new data-dependent representations—allowing for richer, hierarchical feature learning. In the NTK regime, however, the expressivity of the model is fundamentally limited to what can be achieved with the fixed, initial features, and the network cannot automatically discover new structures or representations as training progresses.
To visualize this difference, imagine side-by-side diagrams: on one side, a finite-width network where the hidden layer representations (depicted as evolving feature maps) change and become more specialized as training proceeds; on the other, an infinite-width NTK regime network, where the feature maps remain static throughout, and only the output weights are updated. The former illustrates genuine feature learning, while the latter highlights the fixed-kernel constraint that characterizes NTK theory.
Danke für Ihr Feedback!