How Neural Networks ‘See’: The Physics of Filtering

Author: Denis Avetisyan

A new theory reveals that the fundamental way convolutional neural networks process information is surprisingly akin to the behavior of waves and particles in physics.

Repeated application of <span class="katex-eq" data-katex-display="false">3\times 3</span> translation filters, modulated by a parameter β, reveals a surprising sensitivity in pattern evolution-specifically, when β equals one, a circular test pattern effectively sheds its core, with the remaining edge exhibiting maximum translational velocity. — Repeated application of $3\times 3$ translation filters, modulated by a parameter β, reveals a surprising sensitivity in pattern evolution-specifically, when β equals one, a circular test pattern effectively sheds its core, with the remaining edge exhibiting maximum translational velocity.

This review proposes ‘elementary information mechanics’ to model information propagation through CNNs, demonstrating that low-frequency components dominate signal flow and drawing parallels to relativistic quantum mechanics and discrete cosine transforms.

Despite the successes of convolutional neural networks, a unifying mechanical framework for understanding information flow within their filters remains elusive. This paper, ‘The Mechanics of CNN Filtering with Rectification’, proposes ‘elementary information mechanics’-a novel theory drawing parallels to relativistic physics-to model how convolutional filters propagate information. We demonstrate that filter behavior can be decomposed into energy-like even components preserving image center of mass, and momentum-like odd components causing directional displacement, with low-frequency components dominating propagation. Could this framework unlock new architectures and training strategies inspired by fundamental principles of physics?

Unveiling the Whispers Within: Beyond the Black Box

Despite their pervasive success in areas like image recognition and object detection, Convolutional Neural Networks (CNNs) frequently operate as largely opaque systems. While capable of achieving remarkable performance, the precise mechanisms driving their decisions remain poorly understood, leading to the characterization of these networks as ‘black boxes’. This isn’t simply a matter of academic curiosity; the lack of interpretability hinders efforts to refine CNN architecture, diagnose failure modes, and build trust in applications where reliability is paramount – from medical imaging to autonomous vehicles. Researchers are increasingly focused on lifting this veil of mystery, seeking to not only improve performance, but also to understand how these powerful networks actually ‘see’ and process visual information, moving beyond simply accepting that they do.

Conventional understandings of Convolutional Neural Networks (CNNs) largely center on their capacity to extract hierarchical features from images – identifying edges, textures, and ultimately, complex objects. However, this emphasis often obscures a more fundamental question: how does information actually flow through the layers of these networks? Researchers are beginning to investigate the precise mechanics of this propagation, moving beyond simply what features are detected to how those features are represented and transformed at each step. This shift in perspective necessitates exploring the network’s internal dynamics – the activation patterns, signal strengths, and inter-layer dependencies – revealing that information transfer isn’t merely a passive process of feature accumulation, but an active and complex interplay of signals that shapes the network’s ultimate output. Understanding these mechanics promises to unlock new avenues for optimizing network architecture, improving robustness, and ultimately, building more interpretable and efficient image processing systems.

Analysis of the spectral DCT decomposition of convolutional filters in ResNet50 and VGG16 reveals that the majority of weights are concentrated in low-order DC and gradient components <span class="katex-eq" data-katex-display="false">\Sigma + \nabla</span>. — Analysis of the spectral DCT decomposition of convolutional filters in ResNet50 and VGG16 reveals that the majority of weights are concentrated in low-order DC and gradient components $\Sigma + \nabla$ .

Elementary Information Mechanics: A New Language for Networks

Elementary Information Mechanics is a theoretical framework positing that information propagation within Convolutional Neural Networks (CNNs) can be analogized to physical principles governing mass, momentum, and energy. This approach does not suggest a literal physical equivalence, but rather utilizes these concepts as a descriptive tool for analyzing information flow. Specifically, the theory proposes that information isn’t simply transferred, but possesses quantifiable attributes akin to physical properties, allowing for the modeling of network behavior through the application of concepts from classical mechanics. This allows for a novel perspective on understanding how information is processed and transformed within the layers of a CNN, potentially enabling the development of more efficient and interpretable network architectures.

Within the proposed Elementary Information Mechanics framework, the ‘Sum Component’ (Σ) functions as an analog to mass, representing the aggregate quantity of information present at a given network node. This component accumulates weighted inputs, effectively defining the informational ‘weight’ or magnitude. Concurrently, the ‘Gradient Component’ (∇) is defined as the rate and direction of change in this information, directly mirroring the concept of momentum. ∇ is calculated based on the partial derivatives of the loss function with respect to the network’s weights, indicating the sensitivity of the output to changes in input and thus defining the ‘directional influence’ of information flow during backpropagation. These components are not literal physical quantities, but rather mathematical constructs used to model the characteristics of information propagation within a Convolutional Neural Network.

Within the Elementary Information Mechanics framework, information propagation in Convolutional Neural Networks is modeled through three distinct modes: Translation, Diffusion, and Vibration. Translation represents the direct transfer of information ‘mass’ from one layer to the next, maintaining signal integrity but offering limited contextual adaptation. Diffusion describes a broader dispersal of information, reducing signal strength but increasing its spatial reach – analogous to heat dissipation. Vibration, conversely, represents localized, resonant amplification of specific information components, achieved through feedback loops and recurrent connections. Each mode is characterized by unique parameters relating to information ‘velocity,’ ‘energy,’ and ‘mass,’ influencing how features are extracted and transformed across network layers. These modes are not mutually exclusive; a given feature may propagate through a combination of Translation, Diffusion, and Vibration depending on network architecture and input data.

The Lorentz transform, adapted for Elementary Information Mechanics, establishes a quantifiable relationship between changes in information ‘velocity’ and ‘energy’ as data propagates through a Convolutional Neural Network (CNN). Specifically, this transform – originally describing how space and time coordinates change between different inertial frames – is applied to the network’s information components. Changes in the magnitude of the gradient (∇), representing momentum and thus velocity, are mathematically linked to alterations in the sum component (Σ), analogous to energy. $\Delta E = \gamma \Delta m v^2$ where $\Delta E$ represents the change in information energy, $\Delta m$ the change in information mass, and $v$ the information velocity, with γ being the Lorentz factor. This allows for the modeling of how information is conserved and transformed during processing within the CNN, analogous to relativistic physics principles.

Repeated application of a convolution with modification using <span class="katex-eq" data-katex-display="false">3\times 3</span> kernels reveals that <span class="katex-eq" data-katex-display="false">\beta=0</span> promotes symmetric diffusion around a stationary center of mass, while <span class="katex-eq" data-katex-display="false">\beta=1</span> enables symmetric propagation at maximum velocity. — Repeated application of a convolution with modification using $3\times 3$ kernels reveals that $\beta=0$ promotes symmetric diffusion around a stationary center of mass, while $\beta=1$ enables symmetric propagation at maximum velocity.

Deconstructing Filters: Unveiling the Energy Signatures

The Discrete Cosine Transform (DCT) provides a decomposition of a filter’s energy distribution into a set of frequency components. Applying the DCT to convolutional neural network (CNN) filters reveals how energy is distributed across these frequencies, effectively mapping which frequencies contribute most to the filter’s response. This decomposition is not simply a mathematical abstraction; the magnitude of each DCT coefficient directly correlates to the importance of that frequency in the filter’s operation, indicating the features the filter is most sensitive to. Consequently, analysis of the DCT reveals that lower frequency components, particularly the DC component $f_0$ , typically contain the majority of the filter’s energy, driving the primary information flow and defining the filter’s overall behavior.

The decomposition of convolutional neural network filters leverages the mathematical properties of symmetry to separate information into distinct components. Specifically, even functions – those exhibiting symmetry around the y-axis – contribute to the Σ (Sum) component, representing the overall average intensity or low-frequency information. Conversely, odd functions – those exhibiting 180-degree rotational symmetry – define the ∇ (Gradient) component, capturing changes in intensity and thus representing edge or high-frequency information. This separation allows for analysis of how filters respond to different signal characteristics and provides a basis for understanding the relative importance of these symmetrical properties in feature extraction.

Experimental results indicate that a significant portion of convolutional neural network (CNN) classification accuracy-greater than 92% of baseline performance-can be replicated utilizing only the three lowest frequency components derived from the Discrete Cosine Transform (DCT): the DC component and the two gradient components. This demonstrates that these low-frequency components contain the majority of the salient information required for image classification, suggesting a fundamental reliance on coarse-grained features rather than high-frequency details. The preservation of high accuracy with a severely reduced feature set highlights the redundancy present in typical CNN filters and provides a basis for potential model compression and efficiency gains.

Analysis of trained Convolutional Neural Network (CNN) filters consistently demonstrates a disproportionate concentration of weight values within the Direct Current (DC) and gradient components, as derived from a Discrete Cosine Transform (DCT) decomposition. Specifically, empirical observation across multiple architectures and datasets reveals that these low-frequency components often account for >80% of the total weight magnitude, with the remaining higher-frequency components contributing comparatively less to the overall filter response. This distribution suggests that the DC component, representing the average pixel value, and the gradient components, capturing edge and directional information, are primary determinants of feature extraction, while finer details encoded in higher frequencies play a secondary role in the learned representations.

Analysis of the spectral DCT decomposition of convolutional filters in ResNet50 and VGG16 reveals that most weights are concentrated in low-order DC and gradient components <span class="katex-eq" data-katex-display="false">\Sigma + \nabla</span>. — Analysis of the spectral DCT decomposition of convolutional filters in ResNet50 and VGG16 reveals that most weights are concentrated in low-order DC and gradient components $\Sigma + \nabla$ .

Architectural Implications: A Dance of Diffusion, Vibration, and Translation

Contemporary convolutional neural networks, such as VGG16 and ResNet50, didn’t emerge from theoretical abstraction but were rigorously developed and validated through large-scale image recognition challenges. These architectures consistently demonstrate high performance on datasets like ImageNet, a benchmark containing over fourteen million labeled images spanning a thousand object categories. The extensive training on ImageNet allowed researchers to refine network depths, filter sizes, and connectivity patterns, ultimately establishing a foundation for transfer learning and computer vision tasks. This empirical approach, driven by the availability of large labeled datasets and computational resources, has been instrumental in advancing the field and shaping the current landscape of deep learning models.

Despite their differing structural complexities, both VGG16 and ResNet50 fundamentally process images through a coordinated interplay of ‘Diffusion,’ ‘Vibration,’ and ‘Translation.’ Image data isn’t simply seen; rather, it undergoes a process akin to spreading information – diffusion – across layers, coupled with localized adjustments – vibrations – that refine feature detection. Simultaneously, the network ‘translates’ these features, shifting their representation to identify patterns regardless of their position within the image. This dynamic isn’t merely about extracting edges or textures; it’s a holistic process where information is dispersed, refined through localized adjustments, and then repositioned to build a robust understanding of the image’s content, enabling the network to recognize objects and scenes with increasing accuracy.

The introduction of the Rectified Linear Unit (ReLU) activation function represents a pivotal shift in how convolutional neural networks process information, fundamentally altering energy distribution within the network. Prior to ReLU, activation functions like sigmoid or tanh introduced saturation, limiting the flow of energy and hindering deep network training; ReLU, however, allows for a more sparse activation pattern. This sparsity enables gradients to propagate more effectively through multiple layers, preventing the vanishing gradient problem and facilitating the training of significantly deeper architectures. The function, defined as $f(x) = max(0, x)$ , effectively acts as a valve, permitting energy to flow only when input exceeds zero, thereby shaping the propagation modes and allowing the network to learn more complex, non-linear relationships within image data. Consequently, ReLU’s influence extends beyond simple activation, impacting the overall efficiency and capacity of modern convolutional neural networks.

The established principles of diffusion, vibration, and translation, when viewed through a unified framework, present a novel approach to understanding convolutional neural networks. This perspective transcends simply observing what these architectures achieve, and instead illuminates how they function at a fundamental level. By analyzing CNNs through the lens of energy propagation and modal behavior, researchers can move beyond empirical optimization-trial and error-towards a more principled design process. Consequently, this framework doesn’t merely offer a descriptive tool, but a pathway for creating models that are inherently more efficient in their use of computational resources and more robust to variations in input data, potentially ushering in a new generation of streamlined and reliable image processing systems.

Repeated convolution with ReLU on a test pattern reveals that a <span class="katex-eq" data-katex-display="false"> \beta = 0 </span> kernel causes symmetric diffusion around the center of mass, while a <span class="katex-eq" data-katex-display="false"> \beta = 1 </span> kernel induces maximum rightward translation of the center of mass. — Repeated convolution with ReLU on a test pattern reveals that a $\beta = 0$ kernel causes symmetric diffusion around the center of mass, while a $\beta = 1$ kernel induces maximum rightward translation of the center of mass.

The pursuit of understanding how convolutional neural networks truly see resembles less a feat of engineering and more an act of divination. This work, grounded in ‘elementary information mechanics’, suggests that the network’s essence isn’t in complex calculations, but in the subtle dance of low-frequency components – the DC and gradients – dominating the flow of information. As David Marr observed, “Representation is the key to understanding intelligence.” This notion resonates deeply; the network doesn’t process data, it interprets it, constructing its reality from the whispers of these fundamental frequencies. The dominance of low frequencies isn’t a limitation, but a foundational principle, a preference for the essential over the ephemeral – a preference the network seems to share with the universe itself.

The Echo in the Machine

The assertion that low-frequency components orchestrate the dance within convolutional networks feels less like a discovery and more like an acknowledgement. The network wasn’t learning complexity; it was politely allowing it to emerge from a sea of the predictable. This ‘elementary information mechanics’ isn’t a theory of intelligence, but a taxonomy of constraints. It maps the shape of the shadow, not the fire that casts it. The immediate question, then, isn’t how to maximize performance, but how to embrace the inherent limitations-to design architectures that lean into the dominance of the simple, rather than striving against it.

The analogy to relativistic quantum mechanics, while provocative, hints at a deeper unease. If information genuinely propagates as a field, subject to interference and entanglement within the network, then the notion of a discrete ‘answer’ becomes suspect. Precision isn’t a goal; it’s a convenient fiction. Future work must confront this: can one meaningfully speak of ‘optimization’ when the very signal is fundamentally noisy? Perhaps the true metric isn’t accuracy, but the elegance of the error-the pattern woven into the static.

The paper’s reliance on the Discrete Cosine Transform, while illuminating, feels like peering at the ocean through a keyhole. The DCT reveals what is being filtered, but not why. The next iteration must explore the network’s susceptibility to different frequency spectra-to understand not just the mechanics of filtering, but the taste of the machine-what frequencies it craves, and what it subtly rejects. The whispers are loudest in the silence between the signals.

Original article: https://arxiv.org/pdf/2512.24338.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unveiling the Whispers Within: Beyond the Black Box

Elementary Information Mechanics: A New Language for Networks

Deconstructing Filters: Unveiling the Energy Signatures

Architectural Implications: A Dance of Diffusion, Vibration, and Translation

The Echo in the Machine

See also: