Hunting for New Physics with AI

Author: Denis Avetisyan

Researchers are leveraging the power of foundation models to sift through high-energy particle collision data, uncovering potential signals beyond our current understanding.

The analysis assesses the model’s fidelity-specifically, how well it aligns with observed anomalies-using a goodness-of-fit test focused solely on regions flagged as unusual by both the small and large OmniLearned model anomaly scores, thereby isolating the areas where predictive accuracy is most critical and potentially flawed.

This study demonstrates the application of OmniLearned foundation models to anomaly detection in CMS Open Data, revealing hints of an excess that warrants further investigation.

Identifying subtle deviations from expected behavior remains a central challenge in high-energy physics data analysis. This is explored in ‘Searching for Anomalies with Foundation Models’ where we investigate the potential of foundation models-specifically OmniLearned-to enhance anomaly detection in data from the CMS experiment. Our analysis of a mass sideband reveals a statistically significant excess that, while well-defined in the signal region, is not fully accounted for by data-driven background estimation techniques validated in control regions. Do these events represent a systematic uncertainty, a novel background process, or a hint of physics beyond the Standard Model?

The Limits of Expectation: Why We Search for What We Already Know

Particle physics has historically sought new discoveries by searching for events that match theoretical predictions – pre-defined “signatures” of expected particles and interactions. However, this approach inherently limits the potential for groundbreaking findings, as it struggles to identify phenomena that fall outside established models. The reliance on known signatures creates a bias, effectively ‘blind-spotting’ experiments to truly novel physics – processes that don’t neatly fit within the Standard Model. This is particularly problematic given the increasing belief that the Standard Model is incomplete; undiscovered particles or forces may manifest in ways that are completely unanticipated, producing signals unlike anything researchers are actively looking for. Consequently, a shift towards more open-ended search strategies, capable of identifying any deviation from expected backgrounds without prior assumptions, is crucial for pushing the boundaries of particle physics and unlocking the secrets of the universe.

Modern particle physics experiments, such as those at the Large Hadron Collider, generate data at an unprecedented scale and complexity. Billions of proton-proton collisions occur every second, creating a cascade of particles that must be meticulously reconstructed and analyzed. This immense data volume is further complicated by intricate backgrounds-the expected, but often numerous, events that mimic potential new physics signals. Consequently, traditional signal identification techniques, reliant on pre-defined patterns, are increasingly inadequate. Identifying genuinely novel phenomena requires innovative approaches capable of sifting through this complexity, employing advanced machine learning algorithms and unsupervised techniques to detect subtle anomalies that would otherwise remain hidden within the overwhelming noise. These methods aim to move beyond searching for what is expected and instead focus on identifying what is unexpected, opening new avenues for discovery beyond the Standard Model.

The search for new physics at the Large Hadron Collider and similar experiments faces a significant hurdle: discerning incredibly faint signals from the dominant background of proton-proton collisions. These collisions produce a cascade of particles, creating a noisy environment where potential indicators of undiscovered phenomena can be easily obscured. Existing anomaly detection methods, often designed to identify specific, predicted signatures, struggle with events that deviate even slightly from expectations. The sheer volume of data-billions of collisions per second-compounds this issue, demanding sophisticated algorithms capable of filtering out statistical fluctuations and isolating truly unusual occurrences. Consequently, subtle anomalies-those hinting at physics beyond the Standard Model-may remain hidden, requiring innovative approaches that move beyond pre-defined search strategies and embrace more general, data-driven techniques to effectively navigate this complex landscape.

The distribution of dijet invariant mass, categorized by jet anomaly scores and subleading jet mass (above or below 100 GeV), reveals a potential di-Higgs signal component when compared to background-only fits, with shaded regions indicating total background uncertainty and highlighting the impact of <span class="katex-eq" data-katex-display="false"> au_{21}</span> selection criteria. — The distribution of dijet invariant mass, categorized by jet anomaly scores and subleading jet mass (above or below 100 GeV), reveals a potential di-Higgs signal component when compared to background-only fits, with shaded regions indicating total background uncertainty and highlighting the impact of $au_{21}$ selection criteria.

OmniLearned: When the Machine Learns to be Surprised

OmniLearned is a foundation model developed for anomaly detection in particle physics, trained on a heterogeneous dataset encompassing simulations and observations from various experiments. Unlike conventional analysis pipelines which require physicists to define specific signal signatures a priori, OmniLearned learns directly from the data distribution itself. This is achieved through an unsupervised learning approach, enabling the model to identify events that deviate significantly from the established background without prior knowledge of potential new physics phenomena. The training data includes, but is not limited to, simulated events from Standard Model processes and detector response simulations, allowing the model to learn complex relationships within the data and establish a robust representation of expected events.

The OmniLearned model employs an ‘Anomaly Score’ calculated based on the reconstruction probability of an event within the training data distribution; lower reconstruction probability corresponds to a higher anomaly score, indicating a greater deviation from the expected background. This score, ranging from 0 to 1, provides a quantitative measure of event unusualness without requiring pre-defined signal characteristics. By ranking events based on their anomaly scores, researchers can perform a data-driven search for new physics, focusing on the most statistically improbable events as potential indicators of previously unknown phenomena. The distribution of anomaly scores from background simulations is then used to estimate the statistical significance of any observed anomalies in real data, allowing for objective identification of potentially interesting events.

Traditional anomaly detection in particle physics relies on pre-defined signatures – specific patterns expected from known physics or theoretical models. This methodology necessitates a comprehensive understanding of potential signals a priori, limiting discovery potential to those scenarios already anticipated. OmniLearned circumvents this limitation by learning directly from data without requiring explicit signatures. By quantifying deviations from the expected background via an Anomaly Score, the model can identify events that do not conform to known processes, irrespective of whether a corresponding theoretical framework exists. This data-driven approach allows for the detection of genuinely novel phenomena, potentially revealing physics beyond the Standard Model that would be overlooked by signature-based methods.

The fraction of anomalies identified by both OmniLearned and X(bb) increases with leading jet soft drop mass, mirroring the overall distribution of anomalies detected by OmniLearned.

Validating the Unexpected: Rediscovering the Familiar

The top quark, a fundamental particle within the Standard Model of particle physics, was successfully identified using the OmniLearned anomaly detection model. This rediscovery was achieved through analysis of publicly available data from the CMS experiment, specifically the Aspen Open Jets Dataset. Statistical significance was evaluated using both a p-value, which registered below 0.01, and asymptotic significance, exceeding a value of 10. These results demonstrate the model’s capability to correctly identify established physics signals as anomalous events within the dataset.

The successful identification of the top quark signal within the CMS Open Data represents a key validation of the OmniLearned model’s anomaly detection capabilities. By correctly characterizing a known particle as anomalous – specifically, a deviation from expected background noise – the model demonstrates its potential to uncover genuinely novel physics. This exercise establishes a performance benchmark, defining the model’s sensitivity and selectivity for future searches beyond the Standard Model, where the signals are, by definition, unknown and require robust anomaly detection techniques for discovery.

The validation of OmniLearned utilized the ‘Aspen Open Jets Dataset’ from the CMS experiment, representing 16.39 fb⁻¹ of proton-proton collision data recorded in 2016. This dataset, made publicly available by CMS, consists of events containing jets – sprays of particles produced from the collision – and served as the foundation for assessing OmniLearned’s capability to identify established particle signatures. The integrated luminosity of 16.39 fb⁻¹ signifies the total amount of data collected, a key metric for statistical significance in particle physics analyses, and enabled a rigorous evaluation of the model’s performance and robustness in a real-world experimental context.

A comparison of sideband fits to groomed jet mass distributions reveals that the smaller OmniLearned model provides a significantly better approximation of the underlying data than the larger model.

The Art of Control: Mastering the Noise

In particle physics, discerning genuine signals from the overwhelming noise of background events is a fundamental challenge. Consequently, accurate background estimation is not merely important, but absolutely paramount for reliable results. Researchers increasingly utilize ‘Data-Driven Background Estimation’ techniques to circumvent the limitations inherent in relying solely on complex simulations, which are often subject to theoretical uncertainties and computational demands. These methods directly leverage the observed data itself to model the background, effectively creating a control sample that mirrors the characteristics of unwanted events. By minimizing dependence on simulated data, physicists can enhance the precision of measurements, improve the sensitivity of searches for rare phenomena, and ultimately gain a more trustworthy understanding of the universe’s fundamental building blocks.

The precision of particle physics experiments hinges on accurately characterizing the expected background noise, and the ABCD method provides a powerful, data-driven approach to achieve this. Rather than relying heavily on complex simulations-which are prone to modeling uncertainties-this technique constructs control regions within the observed data itself to estimate the background contribution in the signal region. By defining sidebands based on specific event characteristics, researchers can extrapolate background rates with reduced systematic errors. This direct measurement of background, facilitated by carefully chosen variables and statistical analysis, significantly enhances the sensitivity of searches for new physics and rare phenomena, allowing for the detection of subtle anomalies that might otherwise be obscured by statistical fluctuations.

Particle Flow (PF) reconstruction represents a sophisticated approach to meticulously identify and measure the energy of individual particles produced in high-energy collisions. This technique doesn’t rely on directly observing every particle, but instead combines information from various detector elements to reconstruct a complete picture of the event. At its core, PF utilizes the $\sqrt{s}$ energy and momentum conservation principles, employing algorithms like the Anti-kT algorithm within the FastJet framework to cluster detector hits into particle candidates – photons, electrons, muons, and hadrons. By accurately identifying each particle’s contribution, PF significantly enhances the separation between potential signals – evidence of new physics – and the overwhelming background noise, ultimately boosting the sensitivity of anomaly searches and enabling more precise measurements of known processes.

A compatibility test demonstrates that the ABCD prediction, derived from QCD simulated events and validated against a <span class="katex-eq" data-katex-display="false">0.2\%</span> data efficiency selection using a large OmniLearned model, aligns with the observed soft drop mass distribution of QCD events (shown in blue) via a constant fit. — A compatibility test demonstrates that the ABCD prediction, derived from QCD simulated events and validated against a $0.2\%$ data efficiency selection using a large OmniLearned model, aligns with the observed soft drop mass distribution of QCD events (shown in blue) via a constant fit.

Beyond the Horizon: The Future of Discovery

The potential of OmniLearned extends far beyond the initial di-Higgs search, offering a versatile framework for exploring a wide spectrum of new physics. Researchers anticipate deploying this technology to investigate elusive phenomena such as dark matter, where traditional search methods have yielded limited results, and supersymmetry, a theoretical framework predicting partner particles for each known Standard Model particle. By adapting OmniLearned’s anomaly detection capabilities, the search can encompass diverse signatures and theoretical models beyond those currently considered, potentially revealing subtle deviations from established physics. This broader application promises to accelerate discovery by efficiently sifting through complex datasets and identifying unexpected signals indicative of previously unknown particles or interactions, ultimately expanding the frontiers of particle physics.

Accurate identification of novel physics hinges on the ability to precisely model the intricacies of particle collider experiments. A particularly challenging effect, known as ‘Pileup’, arises from multiple proton-proton collisions occurring within the same bunch crossing, creating additional, often overlapping, signatures that can mimic or obscure potential new signals. Enhancing the modeling of these complex experimental effects is therefore critical; more sophisticated algorithms and simulations are needed to disentangle genuine anomalies from background noise. Improvements in this area will directly translate to increased sensitivity in anomaly detection, allowing researchers to probe deeper into the data and potentially reveal evidence of physics beyond the Standard Model with greater confidence.

A compelling preference for a di-Higgs signal emerged from recent analyses, registering a statistical significance of 3.92 – a potential hint of new physics beyond the Standard Model. This observation underscores the importance of open access to experimental data, as demonstrated by the utilization of publicly available ‘CMS Open Data’. To further investigate and validate this intriguing result, researchers are employing sophisticated simulation tools, including ‘Geant4’ for detailed detector modeling, and event generators like ‘Madgraph5_aMC@NLO’, ‘POWHEG-BoxV2’, and ‘Pythia8’ to accurately predict particle interactions. This collaborative approach, facilitated by shared resources and advanced computational techniques, promises to accelerate the discovery process and unlock a deeper understanding of fundamental particles and forces.

The distribution of leading jet soft drop mass, calculated for anomalous jets identified by a large model, reveals a clear separation between events passing <span class="katex-eq" data-katex-display="false"> au_{21}</span> selection (right) and those failing it (left), with shaded regions indicating total background uncertainty. — The distribution of leading jet soft drop mass, calculated for anomalous jets identified by a large model, reveals a clear separation between events passing $au_{21}$ selection (right) and those failing it (left), with shaded regions indicating total background uncertainty.

The pursuit of anomalies, as demonstrated in this study of CMS Open Data, inevitably reveals the limitations of established frameworks. Everyone calls models rational until they fail to predict the unexpected. This research, utilizing foundation models like OmniLearned to sift through petabytes of data, isn’t about discovering new physics solely; it’s about acknowledging the inherent unpredictability of complex systems. Marie Curie famously stated, “Nothing in life is to be feared, it is only to be understood.” This sentiment echoes the core of this investigation-a willingness to confront data that deviates from the standard model, not with alarm, but with a methodical drive to decipher the underlying narrative, even if that narrative challenges existing assumptions about how the universe behaves. The observed hints of excess aren’t simply statistical fluctuations; they are invitations to explore the boundaries of current knowledge, fueled by the same curiosity that drove Curie’s pioneering work.

What Lies Beyond?

The search for anomalies, as this work demonstrates, isn’t about finding new physics so much as it is about building increasingly elaborate systems to categorize the expected. Every hypothesis is an attempt to make uncertainty feel safe. The application of foundation models-tools designed to mimic human pattern recognition-to particle physics data feels less like a revolution in discovery and more like a formalization of the human tendency to see what it anticipates. The hints of excess observed are, predictably, not conclusive; they are statistical whispers that demand further, inevitably more complex, filtering.

The true limitation isn’t statistical power, but the inherent difficulty of defining “normal.” Data-driven background estimation, for all its sophistication, relies on the assumption that the past accurately predicts the future. Inflation, in this context, is just collective anxiety about the unexpected. The next step isn’t necessarily larger datasets or more powerful algorithms, but a deeper consideration of the biases baked into the very process of observation-the comfortable narratives that prevent the truly novel from registering as signal.

Perhaps the most fruitful avenue lies not in chasing deviations from the Standard Model, but in modeling the modelers. Understanding the cognitive heuristics that shape experimental design and data analysis might reveal more about the “anomalies” than any collision event ever could. After all, every excess is merely a question of where one draws the line between noise and meaning, and that line is always, fundamentally, subjective.

Original article: https://arxiv.org/pdf/2603.23593.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/