Author: Denis Avetisyan
Researchers are leveraging the power of foundation models to sift through high-energy particle collision data, uncovering potential signals beyond our current understanding.

This study demonstrates the application of OmniLearned foundation models to anomaly detection in CMS Open Data, revealing hints of an excess that warrants further investigation.
Identifying subtle deviations from expected behavior remains a central challenge in high-energy physics data analysis. This is explored in ‘Searching for Anomalies with Foundation Models’ where we investigate the potential of foundation models-specifically OmniLearned-to enhance anomaly detection in data from the CMS experiment. Our analysis of a mass sideband reveals a statistically significant excess that, while well-defined in the signal region, is not fully accounted for by data-driven background estimation techniques validated in control regions. Do these events represent a systematic uncertainty, a novel background process, or a hint of physics beyond the Standard Model?
The Limits of Expectation: Why We Search for What We Already Know
Particle physics has historically sought new discoveries by searching for events that match theoretical predictions – pre-defined āsignaturesā of expected particles and interactions. However, this approach inherently limits the potential for groundbreaking findings, as it struggles to identify phenomena that fall outside established models. The reliance on known signatures creates a bias, effectively āblind-spottingā experiments to truly novel physics – processes that donāt neatly fit within the Standard Model. This is particularly problematic given the increasing belief that the Standard Model is incomplete; undiscovered particles or forces may manifest in ways that are completely unanticipated, producing signals unlike anything researchers are actively looking for. Consequently, a shift towards more open-ended search strategies, capable of identifying any deviation from expected backgrounds without prior assumptions, is crucial for pushing the boundaries of particle physics and unlocking the secrets of the universe.
Modern particle physics experiments, such as those at the Large Hadron Collider, generate data at an unprecedented scale and complexity. Billions of proton-proton collisions occur every second, creating a cascade of particles that must be meticulously reconstructed and analyzed. This immense data volume is further complicated by intricate backgrounds-the expected, but often numerous, events that mimic potential new physics signals. Consequently, traditional signal identification techniques, reliant on pre-defined patterns, are increasingly inadequate. Identifying genuinely novel phenomena requires innovative approaches capable of sifting through this complexity, employing advanced machine learning algorithms and unsupervised techniques to detect subtle anomalies that would otherwise remain hidden within the overwhelming noise. These methods aim to move beyond searching for what is expected and instead focus on identifying what is unexpected, opening new avenues for discovery beyond the Standard Model.
The search for new physics at the Large Hadron Collider and similar experiments faces a significant hurdle: discerning incredibly faint signals from the dominant background of proton-proton collisions. These collisions produce a cascade of particles, creating a noisy environment where potential indicators of undiscovered phenomena can be easily obscured. Existing anomaly detection methods, often designed to identify specific, predicted signatures, struggle with events that deviate even slightly from expectations. The sheer volume of data-billions of collisions per second-compounds this issue, demanding sophisticated algorithms capable of filtering out statistical fluctuations and isolating truly unusual occurrences. Consequently, subtle anomalies-those hinting at physics beyond the Standard Model-may remain hidden, requiring innovative approaches that move beyond pre-defined search strategies and embrace more general, data-driven techniques to effectively navigate this complex landscape.

OmniLearned: When the Machine Learns to be Surprised
OmniLearned is a foundation model developed for anomaly detection in particle physics, trained on a heterogeneous dataset encompassing simulations and observations from various experiments. Unlike conventional analysis pipelines which require physicists to define specific signal signatures a priori, OmniLearned learns directly from the data distribution itself. This is achieved through an unsupervised learning approach, enabling the model to identify events that deviate significantly from the established background without prior knowledge of potential new physics phenomena. The training data includes, but is not limited to, simulated events from Standard Model processes and detector response simulations, allowing the model to learn complex relationships within the data and establish a robust representation of expected events.
The OmniLearned model employs an ‘Anomaly Score’ calculated based on the reconstruction probability of an event within the training data distribution; lower reconstruction probability corresponds to a higher anomaly score, indicating a greater deviation from the expected background. This score, ranging from 0 to 1, provides a quantitative measure of event unusualness without requiring pre-defined signal characteristics. By ranking events based on their anomaly scores, researchers can perform a data-driven search for new physics, focusing on the most statistically improbable events as potential indicators of previously unknown phenomena. The distribution of anomaly scores from background simulations is then used to estimate the statistical significance of any observed anomalies in real data, allowing for objective identification of potentially interesting events.
Traditional anomaly detection in particle physics relies on pre-defined signatures – specific patterns expected from known physics or theoretical models. This methodology necessitates a comprehensive understanding of potential signals a priori, limiting discovery potential to those scenarios already anticipated. OmniLearned circumvents this limitation by learning directly from data without requiring explicit signatures. By quantifying deviations from the expected background via an Anomaly Score, the model can identify events that do not conform to known processes, irrespective of whether a corresponding theoretical framework exists. This data-driven approach allows for the detection of genuinely novel phenomena, potentially revealing physics beyond the Standard Model that would be overlooked by signature-based methods.

Validating the Unexpected: Rediscovering the Familiar
The top quark, a fundamental particle within the Standard Model of particle physics, was successfully identified using the OmniLearned anomaly detection model. This rediscovery was achieved through analysis of publicly available data from the CMS experiment, specifically the Aspen Open Jets Dataset. Statistical significance was evaluated using both a p-value, which registered below 0.01, and asymptotic significance, exceeding a value of 10. These results demonstrate the modelās capability to correctly identify established physics signals as anomalous events within the dataset.
The successful identification of the top quark signal within the CMS Open Data represents a key validation of the OmniLearned modelās anomaly detection capabilities. By correctly characterizing a known particle as anomalous – specifically, a deviation from expected background noise – the model demonstrates its potential to uncover genuinely novel physics. This exercise establishes a performance benchmark, defining the modelās sensitivity and selectivity for future searches beyond the Standard Model, where the signals are, by definition, unknown and require robust anomaly detection techniques for discovery.
The validation of OmniLearned utilized the āAspen Open Jets Datasetā from the CMS experiment, representing 16.39 fbā»Ā¹ of proton-proton collision data recorded in 2016. This dataset, made publicly available by CMS, consists of events containing jets – sprays of particles produced from the collision – and served as the foundation for assessing OmniLearnedās capability to identify established particle signatures. The integrated luminosity of 16.39 fbā»Ā¹ signifies the total amount of data collected, a key metric for statistical significance in particle physics analyses, and enabled a rigorous evaluation of the model’s performance and robustness in a real-world experimental context.

The Art of Control: Mastering the Noise
In particle physics, discerning genuine signals from the overwhelming noise of background events is a fundamental challenge. Consequently, accurate background estimation is not merely important, but absolutely paramount for reliable results. Researchers increasingly utilize āData-Driven Background Estimationā techniques to circumvent the limitations inherent in relying solely on complex simulations, which are often subject to theoretical uncertainties and computational demands. These methods directly leverage the observed data itself to model the background, effectively creating a control sample that mirrors the characteristics of unwanted events. By minimizing dependence on simulated data, physicists can enhance the precision of measurements, improve the sensitivity of searches for rare phenomena, and ultimately gain a more trustworthy understanding of the universeās fundamental building blocks.
The precision of particle physics experiments hinges on accurately characterizing the expected background noise, and the ABCD method provides a powerful, data-driven approach to achieve this. Rather than relying heavily on complex simulations-which are prone to modeling uncertainties-this technique constructs control regions within the observed data itself to estimate the background contribution in the signal region. By defining sidebands based on specific event characteristics, researchers can extrapolate background rates with reduced systematic errors. This direct measurement of background, facilitated by carefully chosen variables and statistical analysis, significantly enhances the sensitivity of searches for new physics and rare phenomena, allowing for the detection of subtle anomalies that might otherwise be obscured by statistical fluctuations.
Particle Flow (PF) reconstruction represents a sophisticated approach to meticulously identify and measure the energy of individual particles produced in high-energy collisions. This technique doesn’t rely on directly observing every particle, but instead combines information from various detector elements to reconstruct a complete picture of the event. At its core, PF utilizes the \sqrt{s} energy and momentum conservation principles, employing algorithms like the Anti-kT algorithm within the FastJet framework to cluster detector hits into particle candidates – photons, electrons, muons, and hadrons. By accurately identifying each particleās contribution, PF significantly enhances the separation between potential signals – evidence of new physics – and the overwhelming background noise, ultimately boosting the sensitivity of anomaly searches and enabling more precise measurements of known processes.

Beyond the Horizon: The Future of Discovery
The potential of OmniLearned extends far beyond the initial di-Higgs search, offering a versatile framework for exploring a wide spectrum of new physics. Researchers anticipate deploying this technology to investigate elusive phenomena such as dark matter, where traditional search methods have yielded limited results, and supersymmetry, a theoretical framework predicting partner particles for each known Standard Model particle. By adapting OmniLearnedās anomaly detection capabilities, the search can encompass diverse signatures and theoretical models beyond those currently considered, potentially revealing subtle deviations from established physics. This broader application promises to accelerate discovery by efficiently sifting through complex datasets and identifying unexpected signals indicative of previously unknown particles or interactions, ultimately expanding the frontiers of particle physics.
Accurate identification of novel physics hinges on the ability to precisely model the intricacies of particle collider experiments. A particularly challenging effect, known as āPileupā, arises from multiple proton-proton collisions occurring within the same bunch crossing, creating additional, often overlapping, signatures that can mimic or obscure potential new signals. Enhancing the modeling of these complex experimental effects is therefore critical; more sophisticated algorithms and simulations are needed to disentangle genuine anomalies from background noise. Improvements in this area will directly translate to increased sensitivity in anomaly detection, allowing researchers to probe deeper into the data and potentially reveal evidence of physics beyond the Standard Model with greater confidence.
A compelling preference for a di-Higgs signal emerged from recent analyses, registering a statistical significance of 3.92 – a potential hint of new physics beyond the Standard Model. This observation underscores the importance of open access to experimental data, as demonstrated by the utilization of publicly available āCMS Open Dataā. To further investigate and validate this intriguing result, researchers are employing sophisticated simulation tools, including āGeant4ā for detailed detector modeling, and event generators like āMadgraph5_aMC@NLOā, āPOWHEG-BoxV2ā, and āPythia8ā to accurately predict particle interactions. This collaborative approach, facilitated by shared resources and advanced computational techniques, promises to accelerate the discovery process and unlock a deeper understanding of fundamental particles and forces.

The pursuit of anomalies, as demonstrated in this study of CMS Open Data, inevitably reveals the limitations of established frameworks. Everyone calls models rational until they fail to predict the unexpected. This research, utilizing foundation models like OmniLearned to sift through petabytes of data, isnāt about discovering new physics solely; itās about acknowledging the inherent unpredictability of complex systems. Marie Curie famously stated, āNothing in life is to be feared, it is only to be understood.ā This sentiment echoes the core of this investigation-a willingness to confront data that deviates from the standard model, not with alarm, but with a methodical drive to decipher the underlying narrative, even if that narrative challenges existing assumptions about how the universe behaves. The observed hints of excess aren’t simply statistical fluctuations; they are invitations to explore the boundaries of current knowledge, fueled by the same curiosity that drove Curieās pioneering work.
What Lies Beyond?
The search for anomalies, as this work demonstrates, isnāt about finding new physics so much as it is about building increasingly elaborate systems to categorize the expected. Every hypothesis is an attempt to make uncertainty feel safe. The application of foundation models-tools designed to mimic human pattern recognition-to particle physics data feels less like a revolution in discovery and more like a formalization of the human tendency to see what it anticipates. The hints of excess observed are, predictably, not conclusive; they are statistical whispers that demand further, inevitably more complex, filtering.
The true limitation isn’t statistical power, but the inherent difficulty of defining ānormal.ā Data-driven background estimation, for all its sophistication, relies on the assumption that the past accurately predicts the future. Inflation, in this context, is just collective anxiety about the unexpected. The next step isn’t necessarily larger datasets or more powerful algorithms, but a deeper consideration of the biases baked into the very process of observation-the comfortable narratives that prevent the truly novel from registering as signal.
Perhaps the most fruitful avenue lies not in chasing deviations from the Standard Model, but in modeling the modelers. Understanding the cognitive heuristics that shape experimental design and data analysis might reveal more about the āanomaliesā than any collision event ever could. After all, every excess is merely a question of where one draws the line between noise and meaning, and that line is always, fundamentally, subjective.
Original article: https://arxiv.org/pdf/2603.23593.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Gold Rate Forecast
- Dune 3 Gets the Huge Update Fans Have Been Waiting For
- Jack Osbourne SharesĀ Heartbreaking TributeĀ to Late Dad Ozzy Osbourne
- Looks Like SEGA Is Reheating PS5, PS4 Fan Favourite Sonic Frontiers in Definitive Edition
- Arknights: Endfield ā Everything You Need to Know Before You Jump In
- Antiferromagnetic Oscillators: Unlocking Stable Spin Dynamics
- 5 Weakest Akatsuki Members in Naruto, Ranked
- Action Comics #1096 is Fun Jumping-On Point for Superman Fans (Review)
- 22 actors who were almost James Bond ā and why they missed out on playing 007
- Dungeon Stalkers to end service on June 9
2026-03-26 10:14