Is o3 Really That Smart? Experts Question OpenAI’s AI Scores

OpenAI\'s new inference model, o3 / Digital Today — OpenAI’s new inference model, o3 / Digital Today

TechCrunch reported that OpenAI’s o3 AI model is causing a stir in the tech world due to discrepancies between its benchmark scores and real-world performance.

When OpenAI unveiled o3 last December, they boldly claimed it could tackle 25% of FrontierMath problems.

OpenAI’s Chief Research Officer, Mark Chen, revealed that all currently released products scored less than 2% on FrontierMath. He added that the company found that it could achieve over 25% by aggressively adjusting the internal test time calculations.

However, Epoch AI, the organization behind FrontierMath, put o3 through its paces in an independent benchmark test. The results were a mere 10% lower than OpenAI’s touted figures.

This doesn’t necessarily mean OpenAI was being dishonest. Their December benchmark results included a lower bound that aligns with Epoch’s findings. Epoch also acknowledged potential differences in testing conditions and noted their use of an updated FrontierMath release for evaluation.

OpenAI’s technical staff member Wenda Zhou explained that the production version of o3 has been fine-tuned for real-world applications and boasts improved speed compared to the December demo. This optimization could account for the benchmark discrepancies.

Zhou noted that the model has been optimized to boost cost-effectiveness and overall utility, expressing confidence that this iteration represents a significant improvement.

The ARC Prize Foundation tested o3’s pre-release version and noted that the public o3 model has been tailored for chat and product use, distinguishing it from other models.

This controversy highlights a growing concern in the AI industry regarding benchmark reliability. Recently, Elon Musk’s xAI faced accusations of manipulating benchmark data for its AI model, Grok3. Meta also stirred up controversy by releasing results that differed from internal tests.

Stuck at the Bottom: North Korea 23 Years of Human Trafficking Failures Exposed

South Korea’s Economic Recovery on Track, but Construction Weakness and U.S. Tariff Risks Linger

North Korea to Amplify Successes in Upcoming Political Summit

Is o3 Really That Smart? Experts Question OpenAI’s AI Scores

Check Out Our Content

SK Biopharm’s Xcopri: How This CNS Drug is Redefining Global Market Strategies

Noel Wins Government Backing to Accelerate Expansion in Europe and Latin America

Are Your Biotech Stocks Safe? New KOSDAQ Regulations Could Mean Trouble for 21 Companies

CARTISTEM vs. Traditional Treatments: What Makes Medipost’s Stem Cell Therapy a Game-Changer?

South Korea’s President Lee, Italian Lower House Speaker Discuss Expanding Bilateral Cooperation

Who is Choi Il-hwan? Unveiling the New Face Behind North Korea’s Nuclear Ambitions

Exploring China-North Korea Economic Cooperation: What to Expect After Xi Jinping’s Visit?

Unlocking North Korea: How Smartphones Are Transforming Life Under Kim Jong Un

Kim Jong Un Sends Warm Greetings to Putin: Strengthening North Korea-Russia Alliance on Russia Day

Most Popular Articles

SK Biopharm’s Xcopri: How This CNS Drug is Redefining Global Market Strategies

Noel Wins Government Backing to Accelerate Expansion in Europe and Latin America

Are Your Biotech Stocks Safe? New KOSDAQ Regulations Could Mean Trouble for 21 Companies

CARTISTEM vs. Traditional Treatments: What Makes Medipost’s Stem Cell Therapy a Game-Changer?

South Korea’s President Lee, Italian Lower House Speaker Discuss Expanding Bilateral Cooperation

Who is Choi Il-hwan? Unveiling the New Face Behind North Korea’s Nuclear Ambitions

Exploring China-North Korea Economic Cooperation: What to Expect After Xi Jinping’s Visit?

Unlocking North Korea: How Smartphones Are Transforming Life Under Kim Jong Un

Cars

Tech

future

health