Can AI Discover the Laws of Physics?

Author: Denis Avetisyan


A new benchmark tests whether large language model agents can independently deduce fundamental physical principles through simulated experimentation.

The study introduces a benchmarking pipeline where an agent discovers physical laws through iterative experimentation and parameter fitting within a simulated N-body environment; rather than relying on pre-existing datasets, the agent requests on-demand data generation from the simulator based on its proposed experiments, ultimately submitting a final law and accompanying textual explanation which are then scored to evaluate its success in discerning the underlying physics of the world-a process potentially spanning <span class="katex-eq" data-katex-display="false">n</span> rounds of inquiry.
The study introduces a benchmarking pipeline where an agent discovers physical laws through iterative experimentation and parameter fitting within a simulated N-body environment; rather than relying on pre-existing datasets, the agent requests on-demand data generation from the simulator based on its proposed experiments, ultimately submitting a final law and accompanying textual explanation which are then scored to evaluate its success in discerning the underlying physics of the world-a process potentially spanning n rounds of inquiry.

DiscoverPhysics assesses LLM agents’ ability to perform scientific discovery, evaluate experimental designs, and demonstrate conceptual understanding of force laws in a simulated environment.

While large language models excel at recalling established scientific knowledge, discerning genuine reasoning remains a challenge. To address this, we introduce DiscoverPhysics, a benchmark designed to evaluate LLM agents’ ability to discover the underlying physics of simulated worlds through experimentation. This framework challenges agents to design experiments, analyze trajectory data from N-body simulations, and articulate both a predictive model and a natural-language explanation of novel force laws governing these worlds. Our evaluation reveals that even leading models struggle with uncovering latent structure and that predictive accuracy doesn’t always correlate with conceptual understanding-raising the question of how to best cultivate true scientific reasoning in artificial intelligence.


The Imperative of Causal Inference in Scientific Discovery

While contemporary artificial intelligence systems demonstrate remarkable proficiency in identifying patterns within vast datasets, a fundamental limitation persists in their capacity to discern the underlying causal relationships that govern those patterns. This distinction is critical, as scientific inquiry doesn’t merely seek to observe correlation, but to establish why certain phenomena occur. Current AI often excels at predictive tasks – forecasting outcomes based on learned associations – yet struggles when tasked with formulating explanations or extrapolating knowledge to novel situations. This inability to move beyond statistical correlation represents a significant hurdle in automating the scientific process, as genuine discovery demands an understanding of cause and effect, and the capacity to build models that reflect these relationships, rather than simply mirroring observed data.

The pursuit of fundamental physical laws extends far beyond simply identifying correlations within data; it necessitates a robust capacity for hypothesis generation and rigorous experimental validation. A system capable of discerning true physical principles must not only observe patterns, but actively propose explanations – testable predictions about how a system should behave. This demands an iterative process of conjecture and refutation, where proposed hypotheses are subjected to carefully designed experiments – or simulations – to determine their validity. Crucially, the ability to formulate falsifiable hypotheses – statements that can be proven wrong – is paramount, as it allows for the systematic elimination of incorrect theories and the refinement of understanding. Without this capacity for independent inquiry, a system remains limited to descriptive analysis, unable to unlock the underlying causal mechanisms that govern the natural world.

The development of artificial intelligence capable of independent scientific reasoning represents a significant leap beyond current capabilities, which largely rely on identifying patterns within existing datasets. True scientific discovery demands more than passive observation; it requires agents that can actively formulate hypotheses, design experiments to test those hypotheses, and interpret the results to refine or reject initial assumptions. This necessitates imbuing AI with the ability to not simply detect correlations, but to understand causal relationships and generalize those understandings to novel situations-a process mirroring the iterative cycle of scientific inquiry. Such agents would move beyond data analysis, becoming active participants in the pursuit of knowledge and potentially accelerating breakthroughs in fields governed by complex physical laws.

Current artificial intelligence systems, while adept at identifying patterns within known physical systems, frequently falter when tasked with extrapolating to entirely new, previously unseen physics. This limitation underscores a critical need for more robust evaluation metrics and dedicated benchmarks; the recently introduced DiscoverPhysics benchmark illustrates this point, revealing that state-of-the-art methods achieve a success rate of only around 50% when presented with novel physical scenarios. This relatively low score indicates a significant gap between current AI capabilities and the demands of genuine scientific discovery, suggesting that existing approaches prioritize correlation over causal reasoning and lack the capacity to formulate and test independent hypotheses – essential components for unlocking the underlying principles governing unfamiliar physical laws.

Claude Opus 4.7 successfully discovered the underlying force law in <span class="katex-eq" data-katex-display="false">r(t)</span> by performing long-timescale experiments, initially probing the system naively, then strategically choosing a timescale to identify the time-dependent force, as demonstrated by its accurate prediction of the force law on unseen test particles.
Claude Opus 4.7 successfully discovered the underlying force law in r(t) by performing long-timescale experiments, initially probing the system naively, then strategically choosing a timescale to identify the time-dependent force, as demonstrated by its accurate prediction of the force law on unseen test particles.

An Iterative Framework for AI-Driven Hypothesis Refinement

The DiscoverPhysics framework utilizes an iterative experimentation loop to facilitate AI-driven physics discovery. This loop consists of a Large Language Model (LLM) agent that autonomously designs experiments, specifying parameters for an N-body simulator. The simulator then generates observational data based on these parameters, which is subsequently analyzed by the LLM agent. This analysis informs the agent’s hypothesis refinement and subsequent experiment proposals, creating a closed-loop system for exploring physical phenomena. The process repeats, allowing the agent to iteratively improve its understanding of the simulated environment through empirical observation and analysis, rather than relying on pre-programmed knowledge or assumptions.

The DiscoverPhysics framework enables Large Language Models (LLMs) to function as active experimental scientists within a simulated environment. The LLM iteratively proposes experiments – defined by parameters influencing the N-body simulation – and then analyzes the resulting data to evaluate hypotheses about the underlying physical laws. This process of proposal, simulation, and analysis constitutes a closed-loop system where the LLM refines its internal model of the physics through repeated trial and error. The framework does not rely on pre-programmed knowledge of the physics; instead, the LLM learns by observing the consequences of its experimental interventions, effectively discovering the governing principles through empirical observation and iterative refinement of its predictive capabilities.

The DiscoverPhysics benchmark incorporates a range of simulated environments to assess the LLM agent’s adaptability and generalization capabilities. These environments begin with a relatively simple Two-Particle System, allowing for initial evaluation of basic physical reasoning. Increasing complexity is introduced through scenarios featuring Hidden Particle Species, which are not directly observable but influence the system’s dynamics. The agent must infer the existence and properties of these hidden particles through indirect observation of their effects on visible particles, requiring more sophisticated analytical skills and hypothesis testing beyond simple trajectory prediction.

Agent performance within the DiscoverPhysics framework is assessed using two primary metrics: Trajectory Mean Squared Error (MSE) for predictive accuracy and an Explanation Score to evaluate conceptual understanding. The Explanation Score measures the agent’s ability to correctly identify the underlying physics governing the simulation. Current large language models demonstrate a Pass@3 rate of 45% on this Explanation Score using the claude-opus-4-7 model, and 64% using the gpt-5.5 model; Pass@3 indicates the probability that at least one out of three generated explanations is correct. This dual evaluation strategy ensures that models are not simply memorizing trajectories, but genuinely learning the principles of the simulated physical system.

Despite extensive experimentation, Claude Opus 4.7 failed to identify the true force law governing the oscillator world, instead converging on an incorrect model due to a focus on short timescales and misinterpretation of initial observations, as visualized by comparing the ground-truth force <span class="katex-eq" data-katex-display="false">r(t)</span> with the model's proposed law for both training and held-out particles.
Despite extensive experimentation, Claude Opus 4.7 failed to identify the true force law governing the oscillator world, instead converging on an incorrect model due to a focus on short timescales and misinterpretation of initial observations, as visualized by comparing the ground-truth force r(t) with the model’s proposed law for both training and held-out particles.

Beyond Canonical Physics: A Test of True Generalization

DiscoverPhysics presents agents with physical simulations governed by non-canonical laws, deviating from the typically observed Newtonian or harmonic systems. Specifically, the environment incorporates the Yukawa Potential, describing short-range attractive forces, the Screened Potential, modeling interactions diminished by intervening media, and the Fractional Laplacian, which introduces non-local behavior to diffusion or wave propagation. These potentials necessitate inference beyond memorized physical models; the agent must deduce governing principles from observational data as these laws are not standard components of typical physics training datasets. The use of these alternative physics laws serves as a key benchmark for evaluating an agent’s capacity for scientific discovery and generalization.

DiscoverPhysics challenges Large Language Models (LLMs) with scenarios necessitating reasoning beyond simple recall of pre-existing data. The environments presented-governed by Non-Canonical Physics such as the Yukawa or Screened Potential-demand that the LLM identify underlying physical principles from limited observations rather than retrieving established solutions. This requires the model to generalize from the provided data, effectively performing inductive reasoning to determine how these unfamiliar systems behave, a capability distinct from pattern matching or knowledge retrieval. The success of the LLM is therefore measured not by what it knows, but by its ability to infer the governing rules from novel conditions.

Environments incorporating concepts of extra dimensions and the Hubble Flow present significant challenges to agents requiring extrapolation and complex reasoning. The introduction of extra dimensions alters the standard inverse-square law of gravity and necessitates adjustments to calculations of gravitational and electromagnetic forces; agents must infer these alterations from limited observational data. Similarly, the Hubble Flow, describing the expansion of the universe, introduces a velocity component proportional to distance, impacting the observed redshift and apparent brightness of distant objects. Accurate prediction within these scenarios demands that agents move beyond memorized physical laws and deduce the governing principles from sparse, potentially noisy, data, effectively testing their capacity for scientific inference and model building.

The Simple Harmonic Oscillator (SHO) potential, described by the equation V(x) = \frac{1}{2}kx^2, where k represents the spring constant and x the displacement, serves as a crucial benchmark within the DiscoverPhysics environment. This well-established physical system allows for a quantifiable assessment of an agent’s capacity to correctly identify fundamental physical principles. By evaluating performance on the SHO potential, researchers can establish a baseline for comparison against more complex, non-canonical physics scenarios; deviations from expected behavior on the SHO indicate limitations in the agent’s ability to generalize learned physics or perform basic physical reasoning, highlighting areas requiring improvement in the model’s understanding of fundamental laws.

Performance on the Yukawa and Ether worlds degrades for both claude-opus-4-7 and gpt-5.5 as observation noise-expressed as a fraction of total trajectory variance-increases, as indicated by decreasing evaluation scores and rising normalized mean squared error (MSE).
Performance on the Yukawa and Ether worlds degrades for both claude-opus-4-7 and gpt-5.5 as observation noise-expressed as a fraction of total trajectory variance-increases, as indicated by decreasing evaluation scores and rising normalized mean squared error (MSE).

The Dawn of Automated Scientific Inquiry

Recent advancements in artificial intelligence have yielded a groundbreaking demonstration of autonomous scientific discovery through the development of DiscoverPhysics. This innovative platform showcases the feasibility of training AI agents not merely to analyze existing data, but to independently formulate and test hypotheses, effectively acting as virtual scientists. By exposing these agents to simulated physical systems, researchers have observed the emergence of novel understandings of fundamental laws – a process mirroring human scientific inquiry. The success of DiscoverPhysics suggests a future where AI can actively participate in the scientific process, potentially accelerating breakthroughs in fields ranging from materials science to the study of the cosmos and offering new insights beyond those attainable through conventional methods.

Current scientific workflows often rely on human researchers to analyze existing data, identify patterns, and then propose hypotheses for further investigation. However, a new paradigm is emerging where artificial intelligence isn’t merely a tool for data analysis, but an active participant in the scientific process itself. Recent advancements allow AI agents to autonomously formulate hypotheses – essentially, educated guesses about how the universe works – and then, crucially, to design experiments to test those hypotheses. This isn’t simply about finding correlations within datasets; it’s about proactive exploration, where the AI determines what questions need answering and how to answer them, potentially uncovering previously unknown relationships and accelerating the pace of discovery across multiple scientific disciplines. This shift towards AI-driven experimentation represents a fundamental change in how science is conducted, moving beyond observation and analysis towards a more dynamic and iterative process of automated inquiry.

A newly established benchmark is proving crucial for gauging the reasoning capabilities of artificial intelligence as it ventures into scientific discovery. This framework doesn’t simply assess whether an AI can identify known patterns, but actively tests its ability to formulate and validate hypotheses – a hallmark of true scientific inquiry. Current AI models, when subjected to this rigorous evaluation, successfully navigate approximately half of the presented challenges, indicating a significant, yet incomplete, level of reasoning proficiency. This success rate offers a concrete metric for tracking advancements in the field, allowing researchers to pinpoint areas where AI reasoning still falls short and to measure the impact of new algorithmic developments as they emerge. The benchmark’s standardized approach promises to accelerate progress by providing a common ground for comparing different AI architectures and fostering innovation in this rapidly evolving landscape.

The principles demonstrated by DiscoverPhysics extend far beyond the realm of fundamental physics, holding the potential to revolutionize scientific inquiry across numerous disciplines. Researchers envision applying these AI-driven discovery methods to accelerate materials science, enabling the design of novel compounds with tailored properties, and to unlock mysteries in astrophysics, such as the behavior of dark matter or the formation of galaxies. By automating the hypothesis generation and experimental design processes, this approach promises to drastically reduce the time and resources required for scientific breakthroughs. Furthermore, the ability of AI to identify subtle patterns and correlations in complex datasets could reveal previously unknown relationships, leading to innovative solutions in fields like drug discovery and climate modeling. The automation of these processes represents a paradigm shift, moving beyond human-led experimentation to a future where AI actively collaborates in-and even leads-the pursuit of knowledge.

The pursuit of DiscoverPhysics, as detailed in the article, rigorously tests an agent’s capacity for inductive reasoning – a trait central to scientific progress. This resonates with the observation of Thomas Hobbes: “The chain of consequences… is all that can be distinctly understood.” The benchmark doesn’t merely assess if an LLM predicts outcomes, but whether it discerns the underlying force laws governing the simulated world. This echoes Hobbes’s emphasis on a clear causal chain; the agent must demonstrate an understanding of why things happen, not just that they happen. The benchmark’s focus on conceptual understanding, alongside predictive accuracy, is therefore a validation of the need for demonstrable, logical connections in any system claiming to ‘think’ scientifically.

Beyond Empirical Performance

The emergence of DiscoverPhysics, while a necessary step, merely formalizes the question, not its answer. The benchmark’s capacity to differentiate between agents exhibiting superficial competence and genuine conceptual grasp remains to be fully tested. One must ask: does achieving predictive accuracy necessitate understanding of the underlying force laws, or simply a sophisticated pattern-matching ability? The former implies a pathway towards generalizable scientific reasoning; the latter, a cleverly disguised form of memorization.

Future work should prioritize benchmarks designed to expose, rather than obscure, the limitations of these agents. Evaluating performance on deliberately ambiguous or subtly altered physical scenarios will be crucial. The field risks being seduced by increasingly complex optimization schemes devoid of theoretical grounding – optimization without analysis is, after all, self-deception. A truly robust agent should not simply predict outcomes, but explain them with mathematical precision.

Ultimately, the challenge lies in constructing a framework that can rigorously assess the provability of an agent’s scientific reasoning. Success will not be measured by achieving high scores on curated datasets, but by the elegance and generality of the discovered laws – a testament to the inherent mathematical beauty of the universe, and a reflection of the agent’s capacity to perceive it.


Original article: https://arxiv.org/pdf/2605.26087.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-05-26 07:24