The Algorithmic Muse: Teaching Machines to Discover Mathematical Truth

Author: Denis Avetisyan

Researchers have developed a new environment and method for automated mathematical theory formation, allowing AI to explore and define interesting concepts with minimal human guidance.

EvoAbstract explores an automated approach to mathematical theory formation by iteratively evolving candidate programs-through LLM-driven mutation, abstraction of reusable components, and performance evaluation within a formalized environment-to discover optimal metrics for identifying interestingness, acknowledging that even elegant frameworks ultimately contribute to the ongoing cycle of technical debt.

This work introduces Fermat, a novel environment paired with EvoAbstract, an LLM-driven approach for learning interestingness measures to drive the discovery of new mathematical knowledge.

The pursuit of artificial intelligence capable of genuinely discovering new knowledge remains a central challenge, particularly in complex domains like mathematics. This paper, ‘Learning Interestingness in Automated Mathematical Theory Formation’, addresses this by introducing $\emph{FERMAT}$, a reinforcement learning environment for modeling mathematical concept discovery and theorem proving. We demonstrate that automatically learning what constitutes a ‘meaningful’ mathematical object – its interestingness – is crucial, and achieve notable progress using an LLM-driven evolutionary algorithm capable of synthesizing effective interestingness measures. Could this approach unlock automated systems that not only prove theorems, but also independently formulate and explore novel mathematical landscapes?

The Illusion of Progress: Mapping the Mathematical Wilderness

For centuries, the development of mathematical theory has been intrinsically linked to human intuition and pattern recognition. While profoundly successful, this reliance creates an inherent bottleneck in the rate of discovery; the exploration of mathematical landscapes is limited by the capacity and biases of individual researchers. A mathematician’s ‘hunch’, guided by experience and aesthetic preference, often directs the focus of investigation, potentially overlooking avenues that deviate from established norms. This isn’t a flaw in the process, but a fundamental constraint; the sheer vastness of mathematical possibility – encompassing infinite combinations of axioms, operations, and structures – far exceeds the scope of human exploration. Consequently, potentially groundbreaking theorems and entirely new branches of mathematics may remain hidden, simply because they don’t align with preconceived notions or intuitively ‘feel’ promising. The limitations inherent in human-driven discovery highlight the potential for automated systems to complement, and perhaps even accelerate, the progress of mathematical knowledge by systematically exploring the full breadth of possible mathematical structures, even those defying immediate intuitive appeal.

Successfully automating mathematical discovery requires more than simply computational power; it necessitates a carefully constructed framework for navigating the vast landscape of mathematical possibilities. This framework must not only generate novel conjectures, but also assess their potential significance – a process akin to charting unexplored territories and identifying promising leads. Researchers are developing systems that represent mathematical concepts and relationships as data structures, enabling algorithms to systematically explore variations and combinations. Crucially, these systems employ heuristics and automated theorem provers to test conjectures, verifying their validity or identifying counterexamples. The ability to efficiently search this ‘mathematical space’ – defined by axioms, definitions, and logical rules – is paramount, demanding innovative approaches to algorithm design and computational complexity. Ultimately, the goal is to create a self-improving system capable of formulating, testing, and refining mathematical ideas with minimal human intervention, potentially uncovering relationships previously hidden from even the most skilled mathematicians.

Defining mathematical ‘interestingness’ presents a fundamental obstacle to automated discovery, as simply generating novel statements is insufficient; the system must discern those with the potential for genuine insight. A truly effective metric cannot rely on pre-programmed notions of importance, but must instead evaluate conjectures based on their structural properties, connections to existing mathematical frameworks, and potential to generalize. Researchers are exploring approaches that quantify complexity, symmetry, and the capacity to resolve long-standing open problems, or even to suggest entirely new avenues of inquiry. For example, a conjecture might be deemed ‘interesting’ if it bridges seemingly disparate areas of mathematics, or if its proof necessitates the development of novel mathematical tools. Ultimately, the success of automated mathematical discovery hinges on creating an algorithm that mirrors, and perhaps even surpasses, human intuition in identifying worthwhile mathematical pursuits – a challenge that demands a nuanced understanding of both mathematics and the very nature of creativity itself.

Fermat is an environment where an agent iteratively builds mathematical theories by selecting actions-such as defining new concepts, proposing conjectures, or proving theorems-to expand a knowledge graph representing its current state of understanding.

Fermat: A Symbolic Playground, Not a Revelation

Fermat represents mathematical theory formation as a Markov Decision Process (MDP), a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker. In this context, the ‘state’ of the MDP represents the current mathematical knowledge, ‘actions’ correspond to applying production rules to generate new conjectures, and ‘rewards’ are assigned based on the validity of those conjectures as determined by a theorem prover. This formulation allows the application of reinforcement learning algorithms, enabling Fermat to learn a policy for selecting actions that maximize the cumulative reward, effectively guiding the search for mathematically sound and novel theorems. The MDP structure provides a formal basis for automating the process of mathematical discovery by framing it as a sequential decision problem.

Fermat employs Production Rules, which are structured as conditional statements of the form ‘IF condition THEN action’, to systematically generate new mathematical content. These rules operate on the existing mathematical knowledge represented within the system, allowing for the construction of novel definitions and conjectures. The ‘condition’ component assesses the current state of the knowledge graph, identifying opportunities for expansion or modification, while the ‘action’ component specifies the mathematical operation to be performed – such as defining a new function, formulating a theorem, or proposing a relationship between existing entities. The application of these rules is constrained by the defined mathematical domain, ensuring generated content remains within the specified scope and adheres to established axioms and principles. The rules themselves are represented as formal logical statements, enabling automated execution and verification.

Fermat incorporates the Z3 Theorem Prover as a critical validation step in its automated reasoning process. Z3, a high-performance satisfiability modulo theories (SMT) solver, is used to determine the logical validity of conjectures generated by Fermat’s system. When a new mathematical statement is proposed, it is translated into a first-order logic formula and submitted to Z3. Z3 then attempts to find a model that satisfies the negation of the statement; if no such model exists, the original statement is proven to be valid. This integration allows Fermat to move beyond simply generating hypotheses and provides a robust mechanism for confirming their mathematical truth, preventing the propagation of false or unproven assertions within its knowledge base.

Fermat employs a Knowledge Graph to structurally represent established mathematical concepts, definitions, and relationships. This graph consists of nodes representing mathematical entities – such as functions, sets, or theorems – and edges denoting the relationships between them, like “is a subtype of” or “depends on”. The Knowledge Graph serves as the foundational basis for the system’s reasoning capabilities, enabling Fermat to traverse existing mathematical knowledge, identify relevant concepts for conjecture generation, and constrain the search space for novel insights. Specifically, the graph facilitates both forward and backward reasoning, allowing the system to explore potential generalizations or identify necessary preconditions for a given theorem, thereby directing the exploration towards mathematically plausible and potentially valuable areas.

This knowledge graph illustrates the relationships between key theorems and conjectures within the finite field 𝔽₂₇.

EvoAbstract: Automating the Search for Useful Metrics (and Deluding Ourselves)

EvoAbstract is an evolutionary algorithm leveraging a Large Language Model (LLM) to automatically discover effective ‘Interestingness Measures’. These measures serve as the primary evaluation function within the algorithm, guiding the selection and refinement of programs generated during the evolutionary process. The LLM is utilized to both propose novel metrics and assess the quality of existing ones, effectively shaping the search for optimal evaluation criteria. This approach differs from manually designed metrics by enabling automated discovery tailored to the specific problem domain, in this case, mathematical exploration within the Fermat framework.

EvoAbstract utilizes an LLM-driven evolutionary algorithm to optimize ‘Interestingness Measures’ used for guiding mathematical exploration. This process involves generating candidate metrics via the LLM, deploying agents within the Fermat environment that utilize these metrics, and evaluating agent performance as a fitness function. The LLM then refines the metrics based on this performance data, iteratively improving their effectiveness in identifying valuable mathematical entities.

Abstraction learning within EvoAbstract facilitates the identification and reuse of functional program components – termed ‘subroutines’ – that emerge during the evolutionary process. This is achieved by analyzing evolved programs to detect recurring code patterns representing useful mathematical operations. These identified subroutines are then stored and made available for use in subsequent generations of programs, effectively providing a library of pre-built functionality. By leveraging these learned abstractions, the algorithm avoids redundant computation and accelerates the discovery of effective ‘Interestingness Measures’ as it does not need to re-evolve the same basic operations repeatedly, significantly improving learning efficiency and reducing the time required to achieve optimal performance within the Fermat environment.

EvoAbstract’s learned interestingness measures demonstrate a substantial improvement in the efficiency of mathematical exploration within the Fermat framework. Quantitative evaluation on the arithmetic_base benchmark reveals a ground truth entity discovery rate of 23.98%. This metric represents the percentage of known mathematical entities correctly identified through programs guided by the evolved interestingness measures. The observed rate indicates a significant advancement over baseline methods, suggesting the learned measures effectively prioritize the investigation of promising mathematical concepts within the Fermat environment, leading to a higher success rate in entity discovery.

Across all knowledge graphs and runs, EvoAbstract and FunSearch consistently outperformed baseline methods, with EvoAbstract demonstrating an early lead on arithmetic problems and FunSearch achieving sustained improvement on more complex tasks.

The Illusion of Automation: A New Tool for Old Problems

The automation of mathematical theory formation, long considered a uniquely human endeavor, has been demonstrated through the synergistic combination of two artificial intelligence systems: Fermat and EvoAbstract. This framework moves beyond simple theorem proving or pattern recognition, instead actively generating potentially novel mathematical relationships. By leveraging Fermat’s capacity for symbolic manipulation and EvoAbstract’s evolutionary algorithms, the system explores a vast landscape of mathematical expressions, assessing their ‘interestingness’ and iteratively refining them towards potentially meaningful theorems. This process doesn’t rely on pre-defined hypotheses, but instead allows the system to independently formulate and test conjectures, effectively mimicking the creative spark of mathematical discovery. The success of this approach indicates a significant step towards artificial intelligence not merely assisting, but actively contributing to the expansion of mathematical knowledge, opening exciting possibilities for collaborative exploration between humans and machines in diverse mathematical domains.

The automated framework, leveraging a combination of Fermat and EvoAbstract, has demonstrated a remarkable capacity for independent mathematical discovery within established fields. Specifically, the system successfully identified novel relationships in both ‘Finite Fields’ and ‘Elementary Number Theory’, suggesting a pathway towards accelerating mathematical research. Evaluation on a dedicated finite field benchmark revealed an average of 22.41 previously unknown ground truth entities – verifiable mathematical facts – were discovered by the system. This performance underscores the potential of automated methods not merely to verify existing conjectures, but to actively contribute to the expansion of mathematical knowledge, offering a powerful tool for exploring complex mathematical landscapes and generating new hypotheses.

The utility of the developed framework extends beyond the realm of number theory, demonstrating a capacity for broad applicability across diverse mathematical landscapes. Analysis reveals that the learned measures of mathematical ‘interestingness’ – those qualities which suggest a potentially fruitful avenue of investigation – aren’t specific to any single domain. When applied to benchmarks focused on successor functions and equality – yielding an average of 10.23 ground truth entities discovered – and to a collection of other mathematical challenges – achieving 11.34 discoveries – the system consistently identifies novel relationships. This suggests a fundamental principle has been captured: a generalized approach to discerning patterns and formulating conjectures that transcends specific mathematical structures, promising a powerful tool for automated discovery across a wide spectrum of mathematical inquiry.

The convergence of automated reasoning and machine learning signals a paradigm shift in mathematical discovery, moving beyond computation to genuine theory formation. This research establishes a framework where algorithms not only verify existing conjectures but also propose and explore novel relationships, effectively acting as collaborators in the investigative process. By successfully identifying previously unknown connections within established mathematical fields, such as finite fields and number theory, the system demonstrates the potential to augment human intuition and accelerate the pace of mathematical advancement. This isn’t about replacing mathematicians, but rather providing them with powerful tools to navigate the vast landscape of mathematical possibilities, fostering a synergistic relationship that promises to unlock deeper insights and expand the boundaries of human knowledge, and potentially revolutionizing how mathematical truths are revealed.

EvoAbstract successfully rediscovered key concepts from elementary number theory during testing on the succ_zero_eqandarith_base problem.

The pursuit of automated mathematical theory formation, as demonstrated by Fermat and EvoAbstract, inevitably courts the illusion of progress. This research attempts to codify ‘interestingness’ – a concept notoriously subjective and context-dependent. It’s a valiant effort, yet one that echoes a familiar refrain. As Brian Kernighan observed, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” The elegance of EvoAbstract’s LLM-driven approach to discovering mathematical concepts will undoubtedly encounter the brutal realities of production – edge cases, unforeseen interactions, and the inherent messiness of formal systems. The system will discover what can be proven, not necessarily what is meaningful, and the logs will tell the tale.

The Road Ahead

Fermat, as presented, feels less like a destination and more like the inevitable staging ground for a new class of production issues. The elegance of LLM-driven evolution will, predictably, encounter the blunt reality of mathematical rigor. Interestingness, it turns out, is a moving target, dependent not on intrinsic quality but on the current limitations of the search space. The knowledge graph will grow, yes, but also become a sprawling, undocumented codebase.

The true challenge isn’t generating novel concepts – that’s now largely a question of compute – but filtering the signal from the noise. Any interestingness measure, no matter how cleverly learned, will eventually reward trivial variations and pathological edge cases. Future work will undoubtedly focus on meta-interestingness – measures that assess the robustness of interestingness measures themselves. A necessary, and likely endless, cycle.

One anticipates a proliferation of specialized Fermats, each tailored to a narrow mathematical domain, and each accruing its own unique legacy of unsolved problems and questionable axioms. The pursuit of automated theory formation is not about finding the truth, but about creating increasingly complex systems that demonstrate, with impressive consistency, proof of life.

Original article: https://arxiv.org/pdf/2511.14778.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Progress: Mapping the Mathematical Wilderness

Fermat: A Symbolic Playground, Not a Revelation

EvoAbstract: Automating the Search for Useful Metrics (and Deluding Ourselves)

The Illusion of Automation: A New Tool for Old Problems

The Road Ahead

See also: