Beyond Linearity: How Neural Networks Learn to Extrapolate

Author: Denis Avetisyan

New research reveals that overparameterized neural networks exhibit surprisingly different extrapolation behaviors depending on their proximity to the training data’s origin.

This paper demonstrates quadratic extrapolation near the origin for two-layer ReLU networks trained under the Neural Tangent Kernel, contrasting with linear behavior further away.

While over-parameterized neural networks are known to exhibit linear extrapolation behavior, this phenomenon isn’t universally consistent across the input space. This paper, ‘A Special Case of Quadratic Extrapolation Under the Neural Tangent Kernel’, investigates extrapolation properties within the Neural Tangent Kernel (NTK) regime, revealing a surprising divergence from linearity. Specifically, we demonstrate that a two-layer ReLU network trained with the NTK exhibits quadratic extrapolation near the origin, contrasting with its linear behavior further away. Understanding this nuanced behavior-and the implications of a non-translationally invariant feature map-could unlock new insights into generalization and the limits of predictive power in deep learning.

The Paradox of Overparameterization

The escalating performance of machine learning models increasingly relies on architectures with far more parameters than strictly necessary for the task – a phenomenon known as overparameterization. While seemingly counterintuitive, this approach frequently leads to superior generalization capabilities, defying traditional statistical understanding which predicts overfitting. Investigating this behavior is therefore paramount, as it challenges core assumptions about model complexity and the learning process. Current research suggests that overparameterization can facilitate optimization landscapes with desirable properties, allowing algorithms to find solutions that effectively capture underlying data patterns. Understanding how these models behave, particularly during training, is not merely an academic exercise; it’s foundational for developing more robust, reliable, and efficient machine learning systems capable of tackling increasingly complex problems and avoiding catastrophic failures in real-world applications.

The Neural Tangent Kernel (NTK) regime describes a fascinating dynamic that emerges when training very wide neural networks. In this scenario, as the network’s width approaches infinity, the training process becomes remarkably simple – effectively equivalent to training a linear model in a high-dimensional feature space defined by the network’s initial weights. This simplification allows for rigorous mathematical analysis, yielding predictable characteristics like convergence rates and generalization bounds. Crucially, the kernel – representing the similarity between data points in this feature space – remains fixed throughout training, greatly easing the computational burden of tracking the network’s evolution. Consequently, the NTK regime provides a powerful theoretical lens for understanding why even massively overparameterized neural networks can successfully learn and generalize, offering insights into the interplay between network architecture, training data, and model performance – particularly when considering the function $f(\theta)$ and its gradient $ \nabla_{\theta} f(\theta)$.

A rigorous examination of overparameterized models necessitates a carefully constructed problem setup, beginning with a labeled training set. This dataset, consisting of input features and corresponding correct outputs, forms the foundation for evaluating model performance and generalization capabilities. Crucially, the characteristics of this set – its size, distribution, and the noise it contains – directly influence the resulting analysis. Furthermore, inherent constraints, such as regularization terms or limitations on model complexity, play a vital role in shaping the learning process and preventing overfitting. Defining these constraints allows researchers to isolate specific aspects of model behavior and derive meaningful theoretical results; without a clear and consistent problem setup, interpretations of model dynamics become ambiguous and lack generalizability. Ultimately, the quality and careful specification of the training data and its associated constraints are paramount for drawing robust conclusions about the behavior of neural networks in the overparameterized regime.

Architectural Foundations: Activation and Kernel Dynamics

The model utilizes an overparameterized Multi-Layer Perceptron (MLP) architecture, meaning the number of parameters significantly exceeds the number of data points. This design choice facilitates training and generalization. The MLP consists of multiple fully-connected layers, where each neuron applies a weighted sum of its inputs followed by a non-linear activation function. Specifically, the Rectified Linear Unit (ReLU) – defined as $f(x) = \max(0, x)$ – is employed as the activation function throughout the network. Overparameterization in conjunction with ReLU activations is critical for establishing the theoretical properties that underpin the model’s behavior and training dynamics.

The Rectified Linear Unit (ReLU) activation function, defined as $f(x) = \max(0, x)$, shares a direct mathematical relationship with the indicator function. The indicator function, denoted as $\mathbb{1}_{x>0}$, returns 1 if the condition $x > 0$ is true and 0 otherwise. Consequently, the ReLU function can be expressed as the product of the input $x$ and the Heaviside step function – a form of the indicator function – effectively gating the input signal. This characteristic defines the ‘active region’ of a neuron; when the input $x$ is positive, the neuron transmits the signal, and when it is negative or zero, the output is zero, creating a piecewise linear function and introducing sparsity into the network’s computations.

The Neural Tangent Kernel (NTK) Gram matrix, denoted as $\Theta$, directly influences the training process of wide neural networks. This matrix, computed from the Jacobian of the network’s output with respect to its parameters, captures the relationships between different training examples in the function space. During training with gradient descent, the network’s parameters evolve such that the output converges to a function dictated by the eigenvectors and eigenvalues of $\Theta$. Specifically, the training dynamics are linearized around the initial parameters, and the effective learning rate is determined by the eigenvalues of the NTK. A larger eigenvalue corresponds to a faster rate of change for the associated eigenvector, indicating a more significant impact on the output function during training. Consequently, the spectral properties of $\Theta$ fundamentally govern the convergence speed and final performance of the model.

The Geometry of Extrapolation: Unveiling Higher-Order Behavior

Analysis of the model’s behavior reveals a quadratic extrapolation characteristic, meaning its output demonstrates a predictable, second-order polynomial relationship with its input when values are near the origin ($x = 0$). Empirical results consistently support this finding; deviations from a quadratic function are minimal within a defined neighborhood of zero. This extrapolation isn’t simply an approximation but a fundamental property of the model’s architecture and activation function, suggesting a structured response to inputs that allows for accurate prediction within a limited range surrounding the origin.

Analysis demonstrates that all higher-order derivatives of the model beyond the second order are demonstrably zero. This finding is mathematically proven through the model’s construction and validated by empirical results. Consequently, the extrapolation behavior is definitively quadratic; the model’s output changes in relation to input according to a $x^2$ relationship in the vicinity of the origin. The absence of third and subsequent order derivatives directly establishes the quadratic nature of the extrapolation, confirming it is not a more complex polynomial function.

The connection between model activation and observed higher-order effects is mathematically formalized through the distributional derivative of the indicator function. This derivative is represented by the Dirac delta function, $ \delta(x) $, which, while formally zero everywhere except at $ x=0 $, possesses an integral of one. Consequently, the Dirac delta function acts as a singular impulse, effectively linking the activation function-defined by the indicator function-to the calculated higher-order derivatives. This allows for a precise characterization of the model’s behavior beyond the initial linear response, confirming the quadratic extrapolation observed in the results.

Decoding the Kernel: Implications for Generalization

The Neural Tangent Kernel (NTK) representation reveals that a model’s learning dynamics are intimately connected to its higher-order derivatives; specifically, the $β$ components within the NTK are directly shaped by these rates of change. These components don’t simply reflect the initial conditions or first-order behavior of the network, but instead encapsulate how the model responds to input variations. A substantial change in the higher-order derivatives-indicating a sharper or more nuanced response-manifests as a corresponding alteration in the $β$ values. This connection implies that understanding the higher-order derivatives is crucial for deciphering how an overparameterized neural network generalizes beyond its training data, providing insights into its ability to extrapolate and make accurate predictions on unseen inputs.

The behavior of Neural Tangent Kernel (NTK) representations is significantly altered by manipulating the training data itself. Specifically, a modified dataset, termed Phi_Infinity, achieves this by systematically shifting data points away from the origin in input space. This seemingly simple adjustment has a profound effect on the beta components of the NTK, which govern the model’s learning dynamics and generalization ability. By moving the data distribution, Phi_Infinity effectively reshapes the landscape of higher-order derivatives influencing these components, impacting how the model extrapolates beyond the training data. This approach reveals that the model’s sensitivity to input changes isn’t solely determined by the data’s inherent properties, but also by its position within the input space, offering a novel pathway to control and understand overparameterized model behavior.

Recent investigations into the Neural Tangent Kernel (NTK) representation demonstrate a surprising simplification in the learning dynamics of highly overparameterized models. It appears that higher-order derivatives, crucial for understanding extrapolation capabilities, aren’t independent forces, but are, in fact, fundamentally dependent on lower-order components. This interrelationship dramatically reduces the model’s complexity; instead of needing to analyze an infinite hierarchy of derivatives, researchers can now leverage this dependency to arrive at a closed-form solution. Consequently, the model’s behavior becomes more predictable and analytically tractable, offering a powerful new lens through which to examine the process of learning and generalization in these complex systems, potentially unlocking advancements in areas like robust AI and efficient model design. The discovery suggests that the effective dimensionality of the learning problem is significantly lower than previously assumed.

The study meticulously reveals a shift in extrapolation behavior within over-parameterized neural networks-a move from linear behavior at distance to quadratic behavior near the origin. This nuanced understanding of the network’s response echoes a sentiment articulated by Carl Friedrich Gauss: “Few things are more important than the ability to simplify.” The research doesn’t simply add complexity to the existing body of knowledge regarding the Neural Tangent Kernel; instead, it removes assumptions about consistent extrapolation, revealing a more elegant, albeit conditional, truth. The finding illustrates that a system attempting to model all behaviors equally has, in a sense, already failed to achieve true clarity.

Where Do We Go From Here?

The observation of differing extrapolation behavior – quadratic near the origin, linear elsewhere – within the Neural Tangent Kernel framework is not, itself, a revelation. Rather, it’s a precise delineation of what was already suspected: that these networks, despite their apparent universality, possess a hidden geometry. The focus now must shift from simply demonstrating this behavior to explaining it. The current work illuminates a feature, but the underlying principle – why this particular partitioning of extrapolation regimes? – remains opaque. The temptation to invoke increasingly complex architectures must be resisted; simplicity, after all, is the ultimate elegance.

A critical limitation lies in the strict confines of the two-layer ReLU network. While a useful starting point, it’s a caricature of the networks actually deployed. Extending this analysis to deeper networks, or those employing different activation functions, will undoubtedly reveal further nuances – and likely, further complications. However, the goal should not be to catalogue every possible behavior, but to identify the core principles governing extrapolation, principles that transcend specific architectural choices. Intuition suggests that the Gram matrix of the Neural Tangent Kernel holds the key; a deeper understanding of its spectral properties is paramount.

Ultimately, the true test will be to move beyond analysis and towards control. Can this observed quadratic behavior be harnessed? Can we design networks that deliberately exploit this property to improve generalization performance? The pursuit of perfect interpolation is a fool’s errand. The real challenge lies in building systems that extrapolate, not flawlessly, but intelligently. And that, as always, requires parsimony.

Original article: https://arxiv.org/pdf/2512.15749.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Paradox of Overparameterization

Architectural Foundations: Activation and Kernel Dynamics

The Geometry of Extrapolation: Unveiling Higher-Order Behavior

Decoding the Kernel: Implications for Generalization

Where Do We Go From Here?

See also: