A new math benchmark has just been released and leading AI models are only able to solve 'less that 2%' of the problems... oh no!
I sometimes forget that AI models are used for more than just simple tasks like research and content summaries. In the land of the bigwigs, AI models are used to assist with everything from scientific research to financial analysis. It's because of this that their mathematical abilities are so important.
This is why mathematical benchmarks are necessary. Epoch AI's benchmark, FrontierMath (which its maker has just dropped), is putting LLMs to the test with "hundreds" of "original, expert-crafted math problems designed to evaluate advanced reasoning abilities in AI systems".
Epoch AI reports that while today's AI models are not as weak at other mathematical benchmarks like GSM-8k or MATH, they "solve less than 2%" of FrontierMath questions. This shows a significant gap between the current AI capabilities and collective mathematicians' prowess.
These are difficultproblems. These problems are so difficult that it takes expert mathematicians hours or even days to solve them.
This benchmark is different because it requires "extended chains and precise reasoning with each step building on what came before".
AI models are not very good at advanced math, let alone extended reasoning. This makes sense if you think about what AI models are doing at their core. LLMs, for example, are trained using tons of data in order to determine what the next word will most likely be. The process is essentially a probabilistic one, even though there are many ways to direct the model towards different words.
Lately, however, we have seen AI models apply probabilistic "thinking", in a more directed manner, towards intermediate steps of this "thinking". We've seen AI models try to reason their way throughtheir reasoning, rather than jumping to a probabilityistic conclusion.
There's a new version of ChatGPT-4o that uses reasoning. (And you better not question it.) It's also interesting that you could be rewarded for answering a question the AI cannot answer for "humanity’s last exam".
Of course, these individual steps of reasoning might themselves be arrived at probabilistically--and could we expect any more from a non-sentient algorithm?--but they do seem to be engaging in what we flesh-and-bloodies after the fact consider to be "reasoning".
But we're still a long way from seeing these AI models reach the level of reasoning that our brightest and best can achieve. Now that we have a mathematical standard that can really put them to the test, 2% isn't that great. (And take that robots.
Terence Tao, Fields Medalist, told Epoch AI that he believes the only way in the near future to solve the FrontierMath problem, short of having an expert in the field, is a combination of semi-experts like a graduate in a related area, perhaps paired with a combination modern AI and many other algebra packages ..."
The FrontierMath benchmark is a good litmus for future improvements. It ensures that the models are not just spitting out mathematical nonsense, which only experts can verify.
In the end, we must remember that AI does not aim for truth, no matter how closely we humanstarget its probabilistic reasoning towards results that tend to the truth. The philosopher in me asks: Can truth exist for AI without it having an internal life that aims towards truth, even if the AI spews it out, if it doesn't have an inner life that aims towards truth? Truth for us yes, but what about the AI? I don't think so, and this is why benchmarks such as these will be critical moving forward into this new industrial Revolution, or whatever it's called these days.
Comments