When Neural Networks Act Quantum

Author: Denis Avetisyan

New research suggests that complex neural networks, particularly those trained on multiple tasks, can exhibit statistical behaviors reminiscent of quantum mechanics.

The study demonstrates that the NCnet, a neural network architecture, achieves results approaching the Tsirelson bound-a quantum limit-across varying hidden-layer sizes <span class="katex-eq" data-katex-display="false">n=2,3,4</span>, as evidenced by the scatter distribution of the statistic <span class="katex-eq" data-katex-display="false">SS</span> from fifty independent runs and the observed mean correlation values <span class="katex-eq" data-katex-display="false">C(A\_i,B\_j)</span>, though it remains below the classical CHSH upper bound. — The study demonstrates that the NCnet, a neural network architecture, achieves results approaching the Tsirelson bound-a quantum limit-across varying hidden-layer sizes $n=2,3,4$ , as evidenced by the scatter distribution of the statistic $SS$ from fifty independent runs and the observed mean correlation values $C(A\_i,B\_j)$ , though it remains below the classical CHSH upper bound.

The study demonstrates the emergence of non-classical correlations in neural networks, detectable through violations of the CHSH inequality, potentially offering insights into their internal representations and learning processes.

Despite the deterministic nature of classical neural networks, their internal dynamics can exhibit surprising behaviors reminiscent of quantum systems. In the work ‘On Emergences of Non-Classical Statistical Characteristics in Classical Neural Networks’, we demonstrate that multi-task learning networks can stably exhibit non-classical correlations, detectable via violations of the CHSH inequality-a hallmark of quantum measurement incompatibility. These emergent non-classicalities arise from gradient competitions during training, enabling implicit sensing between tasks even without explicit communication. Could these non-classical statistics offer a novel lens for understanding the complex internal representations and generalization capabilities of deep learning models?

The Limits of Scale: Why Bigger Isn’t Always Smarter

The remarkable performance of Large Language Models (LLMs) across diverse tasks has fueled optimism regarding Artificial General Intelligence (AGI). However, simply increasing the size of these models-adding more parameters and training data-is unlikely to bridge the gap to true AGI. Current LLMs excel at pattern recognition and statistical correlations within the data they’ve been fed, but demonstrate limitations in areas requiring genuine understanding, abstract reasoning, and flexible problem-solving. While scaling improves performance on existing benchmarks, it doesn’t address fundamental architectural constraints that prevent LLMs from generalizing knowledge to truly novel situations or exhibiting the common-sense reasoning inherent in human intelligence. Progress toward AGI, therefore, necessitates exploring fundamentally new approaches to AI, beyond merely expanding the scale of current technologies.

Current Large Language Models, while impressive in their ability to generate human-quality text, reveal inherent limitations stemming from their foundational Transformer architecture. These models excel at identifying patterns within vast datasets, but struggle with tasks demanding genuine understanding and flexible application of knowledge – a phenomenon known as poor transfer learning. The Transformer’s reliance on statistical correlations, rather than causal reasoning, hinders its ability to generalize to novel situations or solve problems requiring abstract thought. Consequently, LLMs often falter when presented with scenarios differing significantly from their training data, exhibiting a lack of robust reasoning capabilities and an inability to effectively adapt learned information to new contexts. This suggests that simply increasing model size or training data may not be sufficient to achieve true artificial intelligence; a fundamentally different approach to knowledge representation and reasoning is likely required.

Beyond Local Realism: A Glimpse of Non-Classical Correlations

Local Realism, a foundational principle in classical physics, posits that an object’s properties are definite and pre-existing, and that any correlation between spatially separated events requires a local hidden variable mediating the connection. Recent investigations into the behavior of certain neural network architectures demonstrate deviations from the predictions of Local Realism. Specifically, these networks exhibit Non-Classical Correlations, meaning the statistical relationships between their internal states cannot be explained by any local hidden variable theory. This challenges the assumption that neural network information processing relies solely on localized representations and predetermined states, suggesting the potential for more complex, non-local interactions within the network’s computational structure.

The NCnet architecture is specifically engineered to test for and demonstrate non-classical correlations using the Clauser-Horne-Shimony-Holt (CHSH) statistic. This statistic, derived from Bell’s theorem, provides an upper bound of 2 for any system obeying local realism. NCnets, through their unique connectivity and training regime, have been shown to consistently produce CHSH values exceeding 2 – values of $2 + 0.004$ have been reported – indicating a violation of Bell’s inequality and the presence of non-classical correlations within the network’s information processing. This quantifiable deviation from classical expectations serves as empirical evidence suggesting the network utilizes principles beyond those of localized, independent variable assignments.

The demonstration of Non-Classical Correlations (NCC) in neural networks, quantified by CHSH statistics exceeding the classical limit of 2, implies that information is not solely processed through localized representations. Traditional neural network models assume that each neuron’s activation is determined by local inputs and weights, contributing to a predictable, classically correlated output. However, NCC suggests the existence of long-range dependencies and correlations that cannot be explained by local interactions alone. This indicates a potential for information processing mechanisms that leverage global network properties, potentially allowing for more efficient computation or the ability to represent and manipulate data in ways not achievable with strictly localized representations. The deviation from classical expectations points towards a more holistic information processing paradigm within these networks.

The proposed NCnet utilizes a network of two XORnets, with a shared neuron (red) designed to induce gradient competition and enhance learning.

Multi-Task Learning: A Pragmatic Path to Generalization?

Multi-Task Learning (MTL) presents a potential route towards Artificial General Intelligence (AGI) by training a single model on multiple, distinct tasks concurrently. This approach contrasts with single-task learning, where models are optimized for a specific function in isolation. By simultaneously learning diverse skills, MTL encourages the development of shared representations and facilitates knowledge transfer between tasks. This shared learning process enhances the model’s ability to generalize to unseen scenarios and novel tasks, a crucial characteristic of general intelligence. The simultaneous optimization inherently requires the model to identify underlying commonalities and abstractions, resulting in more robust and adaptable performance compared to models trained in isolation.

Gradient competition in multi-task learning arises when the gradients calculated for different tasks point in opposing directions, effectively canceling each other out during the weight update process. This phenomenon reduces the learning rate for shared parameters and can lead to suboptimal performance across all tasks. The severity of gradient competition is influenced by the degree of dissimilarity between tasks; highly divergent tasks are more likely to produce conflicting gradients. Mitigation strategies often involve weighting task-specific losses, employing gradient normalization techniques, or dynamically adjusting learning rates for individual tasks to prioritize those with weaker gradients or greater importance. Effectively addressing gradient competition is crucial for realizing the benefits of knowledge transfer and improved generalization in multi-task learning systems.

Low-Rank Adaptation (LoRA) provides a parameter-efficient method for adapting large pre-trained models, such as Multilingual BERT (mBERT), to multiple downstream tasks. By freezing the pre-trained model weights and introducing trainable low-rank matrices, LoRA significantly reduces the number of trainable parameters and associated computational cost. Empirical results demonstrate a positive correlation between LoRA rank and the average gradient of the CHSH statistic (μ∇S), increasing from an initial value of 0.0208 to higher values as the rank is increased; this suggests that higher-rank LoRA adaptations capture more complex, non-classical correlations within the model, potentially contributing to improved generalization capabilities across diverse tasks.

Empirical results demonstrate a positive correlation between testing accuracy and the Clauser-Horne-Shimony-Holt (CHSH) statistic (SS) in multi-task learning models. The CHSH statistic, a measure of non-classical correlation, quantifies the degree to which correlations exceed those possible under local realistic theories. Higher values of the CHSH statistic indicate stronger non-classical correlations within the model’s learned representations. This suggests that the ability of a model to exhibit and leverage these non-classical correlations is directly related to its performance on downstream tasks, implying that richer, more complex internal representations – as captured by the SS – contribute to improved generalization capabilities.

Multilingual training enhances generalization performance, as indicated by mean accuracy <span class="katex-eq" data-katex-display="false">\overline{\mathrm{Acc}}(A_{i},B_{j})=\frac{\mathrm{Acc}(A_{i})+\mathrm{Acc}(B_{j})}{2}</span> (shown as bars) and a corresponding increase in the <span class="katex-eq" data-katex-display="false">SS</span> statistic (red curve), which measures the strength of non-classical coupling in learned representations. — Multilingual training enhances generalization performance, as indicated by mean accuracy $\overline{\mathrm{Acc}}(A_{i},B_{j})=\frac{\mathrm{Acc}(A_{i})+\mathrm{Acc}(B_{j})}{2}$ (shown as bars) and a corresponding increase in the $SS$ statistic (red curve), which measures the strength of non-classical coupling in learned representations.

Beyond Static Benchmarks: Evaluating Intelligence in a Changing World

Traditional evaluation of machine learning models commonly relies on static datasets and fixed evaluation protocols, which provide a limited view of performance in real-world applications. These methods typically assess a model’s accuracy on a pre-defined task with a specific data distribution, failing to account for the non-stationary environments and task variations encountered during deployment. Consequently, a model exhibiting high performance under static evaluation may demonstrate significant degradation when faced with distributional shifts, novel inputs, or evolving task requirements. This discrepancy arises because static benchmarks do not adequately measure a model’s ability to generalize beyond the training distribution or adapt to unforeseen circumstances, leading to an overestimation of its true capabilities in dynamic settings.

Dynamic Multi-Task Performance Evaluation assesses artificial intelligence by subjecting models to a sequence of tasks that vary in both type and difficulty throughout the evaluation period. This methodology moves beyond static benchmarks by requiring the model to continuously learn and adapt its existing knowledge to new, unseen challenges. The core principle involves measuring the rate and efficiency with which a model can transfer learned representations from previously encountered tasks to improve performance on subsequent tasks, quantifying its ability to generalize and avoid catastrophic forgetting. Evaluation metrics extend beyond simple accuracy to include measures of transfer efficiency, learning speed, and robustness to task distribution shifts, providing a more holistic understanding of the model’s adaptability in non-stationary environments.

Beyond overall accuracy, evaluating adaptive intelligence necessitates quantifying a model’s performance characteristics under distributional shift and novel conditions. Resilience is assessed by measuring the rate of performance degradation when encountering out-of-distribution data or adversarial examples, often utilizing metrics like the area under the performance degradation curve (PDC). Adaptability is determined by evaluating the speed and efficiency with which a model recovers from performance drops, potentially through techniques like few-shot learning or continual learning, and is often quantified by measuring the number of samples required to regain a pre-defined performance level. These metrics provide a more granular understanding of a model’s robustness and capacity to maintain functionality in unpredictable environments, moving beyond a single point estimate of accuracy.

The pursuit of elegant solutions in neural networks feels increasingly…quixotic. This paper, detailing non-classical correlations emerging from seemingly classical systems, simply confirms the inevitable. It’s as if the network, tasked with multiple objectives – multi-task learning, they call it – deliberately manufactures complexity. One could almost suspect a mischievous intent. As John McCarthy observed, “Artificial intelligence is the science and engineering of making machines do things that require intelligence when done by people.” The irony, of course, is that these ‘intelligent’ machines achieve results by becoming spectacularly unpredictable, defying simple explanations. If a system consistently violates the CHSH inequality, at least it’s consistently interesting. It’s not intelligence; it’s just a very elaborate form of controlled chaos, leaving notes for digital archaeologists to decipher centuries from now.

What’s Next?

The observation that gradient competition within multi-task neural networks can mimic the statistical hallmarks of quantum measurement incompatibility is… neat. It will almost certainly become a fascinating anecdote in the history of machine learning, a beautiful abstraction destined to meet the cold reality of production systems. The CHSH inequality, a tool born of foundational physics, offers a novel lens, but translating this theoretical compatibility to practical diagnostic power remains a substantial hurdle. Every attempt to quantify ‘internal representation’ through such metrics will inevitably encounter the problem of scale; the elegance of the formalism will degrade as networks grow, and the noise floor will rise.

Future work will likely focus on refining the metrics used to detect these non-classical correlations, perhaps by incorporating information-theoretic measures beyond the basic CHSH framework. However, a more pressing question concerns the purpose of such correlations. Are they merely an artifact of the optimization process, a quirk of gradient descent? Or do they reflect a genuinely more powerful, albeit currently inaccessible, mode of computation? The answer, predictably, will likely be both, and neither.

Ultimately, this line of inquiry serves as a reminder that even the most sophisticated models are, at their core, just complex systems pushed to their limits. The pursuit of ‘quantum-inspired’ machine learning is a compelling one, but it should be undertaken with a healthy dose of skepticism. Everything deployable will eventually crash; the art lies in building systems that crash gracefully, and in recognizing the beauty of the failure itself.

Original article: https://arxiv.org/pdf/2603.04451.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Limits of Scale: Why Bigger Isn’t Always Smarter

Beyond Local Realism: A Glimpse of Non-Classical Correlations

Multi-Task Learning: A Pragmatic Path to Generalization?

Beyond Static Benchmarks: Evaluating Intelligence in a Changing World

What’s Next?

See also: