OpenAI Secretly Accessed Benchmark Data, Raising Questions About AI Model's Scores

Over the next few weeks, OpenAI plans to reveal its new offering – the compact o3 mini, which is part of the o3 series and boasts sophisticated reasoning abilities in areas such as math, science, and coding. As per CEO Sam Altman, this model demonstrates exceptional performance, potentially surpassing that of the ChatGPT’s creator’s o1 series. In addition to this, OpenAI also shared its intentions to simultaneously launch the application programming interface (API) and a version of ChatGPT for the new model.

Although specifics about the AI model are limited, there are indications that OpenAI discreetly financed and utilized the FrontierMath evaluation data, sparking doubts about whether the company employed this data for training its o3 model (as reported by Search Engine Journal). The AI model has achieved impressive scores across numerous benchmarks; however, if the mounting concerns are any indication, the published results could potentially be an artificial representation of the model’s actual performance.

It’s worth noting that OpenAI may have kept the mathematicians working on FrontierMath unaware, as they covertly financed its development. However, Epoch AI is said to have revealed this secret funding in the final paper about the benchmark, which was later published on Arxiv.org.

As recently highlighted in Reddit’s r/singularity subreddit:

The innovative math assessment known as Frontier Math, which is backed financially by OpenAI, has sparked concerns due to the revelation that OpenAI may have access to both the problems and their solutions. This is troubling since the benchmark was marketed to the public as a tool for evaluating state-of-the-art models, with input from esteemed mathematicians. In truth, it appears that Epoch AI has been creating datasets for OpenAI, a connection that was not previously disclosed.

In essence, the credibility of the FrontierMath project is under scrutiny. While its purpose is to evaluate AI models, the issue lies in that OpenAI may already possess the questions and answers. This renders the project moot, as other models would face an unfair disadvantage if they were to be judged against it. It wouldn’t serve as a fair benchmark for assessing the logic and output of alternative models.

According to a post on Meemi’s Shortform about the FrontierMath project and OpenAI’s involvement:

FrontierMath’s mathematicians responsible for creating problems weren’t informed directly about OpenAI funding. Instead, they were advised to keep the tasks and their solutions confidential, without using platforms like Overleaf or Colab, or discussing them via email. They were also required to sign Non-Disclosure Agreements (NDAs). This was done to maintain the questions’ secrecy and prevent leaks. Moreover, there was no communication regarding OpenAI funding on December 20th. I suspect that some of the paper’s named authors may have been unaware of the OpenAI funding.

It appears that Epoch AI’s Associate Director, Tamay Besiroglu, seems to support the points made earlier, but he mentioned something called a “holdout” which could suggest that OpenAI did not have full access to FrontierMath’s entire dataset.

When it comes to training data: We confirm that OpenAI does possess a significant portion of FrontierMath problems and their solutions, but there’s an untouched set kept hidden from OpenAI, which allows us to independently assess the model’s capabilities. It’s understood in a verbal agreement that these resources won’t be utilized during the model training process.

As a tech enthusiast, I’m thrilled that OpenAI has backed our choice to keep a hidden reserve dataset – an additional precaution to avoid overfitting and guarantee precise progress tracking. Right from the start, FrontierMath was envisioned and introduced as a tool for evaluation, and I believe these arrangements underscore that purpose.

Contrarily, Elliot Glazer, the primary mathematician at Epoch AI, appears to hold a contrasting view. He claims that OpenAI’s score is genuine, as they didn’t train on the dataset, and further states that there’s no motivation for them to fabricate their internal benchmarking results. In response, Epoch AI is creating a separate test dataset to assess OpenAI’s o3 model, since it won’t have access to the problems or solutions. However, Glazer cautions that we can only verify their claims once our independent evaluation is done.

2025-01-22 13:40

OpenAI Secretly Accessed Benchmark Data, Raising Questions About AI Model’s Scores

Read More