
During a live stream announcing the generative AI model Grok 3, Elon Musk provoked OpenAI CEO Sam Altman and his research team by declaring it the smartest AI on Earth.
However, Musk is facing increasing skepticism from experts as he has failed to provide substantial evidence to back up his claim.
According to IT industry sources, on Friday, Musk asserted that Grok 3 outperformed GPT-4, GPT-o3-mini-high, and Gemini 2.0 based on the math, science, and coding benchmark indicators. Yet, he has failed to release any technical reports or detailed information to support these assertions.
Experts are increasingly questioning Grok 3’s real-world performance.
Critics argue that benchmarks obtained under optimized conditions don’t accurately reflect practical AI capabilities. They point out that many math problems and specialized knowledge tests used in these benchmarks, which have little relevance to everyday applications, are of little practical use.

Ethan Mollick, a professor at the University of Pennsylvania’s Wharton School, stated that benchmark tests have now fallen to the level of restaurant reviews.
Stanford University researchers noted that after examining more than 150 benchmark cases, they found evidence of data condition manipulation and could not replicate figures under different conditions. They also suggested that companies might inflate scores through selective data condition control.
OpenAI has accused xAI of deliberately omitting the “cons@64” score of its o3-mini-high model to exaggerate Grok 3’s performance on the American Mathematics Competitions 2025 (AIME 2025) benchmark. The cons@64 method involves the AI attempting each problem 64 times and selecting the most frequent response as the final answer.
OpenAI claims that when the cons@64 score is included, Grok 3’s inference beta performs worse than its o3-mini-high and existing o1 models.
John Schulman, a senior researcher at OpenAI, commented that the results of the Massive Multitask Language Understanding (MMLU) have not been disclosed, which raises questions about Grok 3’s generalization ability.
The European Commission Joint Research Centre criticized major U.S. tech companies for overhyping results designed to attract investors, describing current AI performance evaluations as little more than marketing tools.
Adding to the controversy, Musk’s comment has embroiled Grok 3 in accusations of hypocrisy regarding censorship.
While Musk promoted Grok 3 by mocking the censorship features of China’s DeepSeek R1 and ChatGPT, it was revealed that Grok 3 also censored information related to Musk himself and former President Trump.
Grok 3 included a system prompt instructing it to disregard sources claiming Elon Musk and Donald Trump spread misinformation, which directly contradicts its supposed principle of being an “uncensored AI.”
As the controversy grew, Igor Babuschkin, xAI’s engineering lead, attempted to deflect blame by claiming an anonymous employee had mistakenly adjusted the system prompt.