New secret math benchmark stumps AI models and PhDs alike

May Be Interested In:Tory peer Lord Rami Ranger stripped of CBE


Epoch AI allowed Fields Medal winners Terence Tao and Timothy Gowers to review portions of the benchmark. “These are extremely challenging,” Tao said in feedback provided to Epoch. “I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages.”

A chart showing AI models’ limited success on the FrontierMath problems, taken from Epoch AI’s research paper.


Credit:

Epoch AI

To aid in the verification of correct answers during testing, the FrontierMath problems must have answers that can be automatically checked through computation, either as exact integers or mathematical objects. The designers made problems “guessproof” by requiring large numerical answers or complex mathematical solutions, with less than a 1 percent chance of correct random guesses.

Mathematician Evan Chen, writing on his blog, explained how he thinks that FrontierMath differs from traditional math competitions like the International Mathematical Olympiad (IMO). Problems in that competition typically require creative insight while avoiding complex implementation and specialized knowledge, he says. But for FrontierMath, “they keep the first requirement, but outright invert the second and third requirement,” Chen wrote.

While IMO problems avoid specialized knowledge and complex calculations, FrontierMath embraces them. “Because an AI system has vastly greater computational power, it’s actually possible to design problems with easily verifiable solutions using the same idea that IOI or Project Euler does—basically, ‘write a proof’ is replaced by ‘implement an algorithm in code,'” Chen explained.

The organization plans regular evaluations of AI models against the benchmark while expanding its problem set. They say they will release additional sample problems in the coming months to help the research community test their systems.

share Share facebook pinterest whatsapp x print

Similar Content

John Smyth: Why police didn’t prosecute abuser linked to CofE
John Smyth: Why police didn’t prosecute abuser linked to CofE
Gregg Wallace's ghostwriter says MasterChef host sexually harassed her
Gregg Wallace’s ghostwriter says MasterChef host sexually harassed her
Newspaper headlines: 'DisasterChef' and 'Syrian warplanes hit back'
Newspaper headlines: ‘DisasterChef’ and ‘Syrian warplanes hit back’
The Sphinx: My pilgrimage to Scotland's vanishing snow patch
The Sphinx: My pilgrimage to Scotland’s vanishing snow patch
Assisted dying bill 'about right to choose,' says minister Liz Kendall
Assisted dying bill ‘about right to choose,’ says minister Liz Kendall
Seven memorable moments in the life of Labour's John Prescott
Seven memorable moments in the life of Labour’s John Prescott
Trending Now: What Everyone's Talking About | © 2024 | Daily News