OpenAI's New Benchmark Shows AI Does Knowledge Work 100X Faster and Cheaper Than Experts

Written by Mike Kaput | Sep 30, 2025 12:58:33 PM

For years, the gold standard for measuring AI progress has been challenging academic tests and abstract puzzles. But the real question has always been: Can AI do the actual work people get paid for?

OpenAI is attempting to answer that question with the launch of its new evaluation framework, GDPval, and the results are a wake-up call for every knowledge worker and business leader.

According to the blind evaluations run by industry experts, today’s best models—like GPT-5 and Claude Opus 4.1—are already producing work rated as equal to or better than human output nearly half the time. This framework, which measures performance across 44 knowledge work occupations, is the kind of real-world assessment that AI has desperately needed.

To unpack this new evaluation framework’s significance, I spoke to SmarterX and Marketing AI Institute founder and CEO Paul Roetzer on Episode 170 of The Artificial Intelligence Show.

Why GDPval Is the Real-World Test That Matters

At its core, GDPval basically functions like a real-world test for AI to determine if it can do economically valuable knowledge work. Unlike traditional benchmarks that use simple text prompts or exam-style questions, the GDPval evaluation system is built on real-world deliverables and contexts:

The evaluation spans 1,320 specialized tasks, all based on real work products like legal briefs, engineering blueprints, customer support conversations, and nursing care plans.
Every task was meticulously crafted by subject matter experts with over a decade of experience, who then served as the blind graders. They compared the human- and AI-generated deliverables without knowing which was which, offering critiques and rankings.
The tasks are not simple text prompts; they include reference files and context, with expected deliverables spanning documents, slides, diagrams, spreadsheets, and multimedia.

This focus on the reality of work is critical.

“The thing we’ve talked about for a while is that the IQ tests [in traditional AI evaluations] were saturated,” he says. “What we really needed to understand was the implications on actual work. People do the tasks that are part of these jobs.”

And, if GDPval is any indication, AI is getting very good at the tasks that people do as part of their jobs.

100X Faster and 100X Cheaper

OpenAI's research found that frontier models can complete the GDPval tasks approximately 100 times faster and 100 times cheaper than human industry experts.

Roetzer emphasized the significance of this finding, especially considering the comparison point: these are industry experts, not just average workers. We are already at the point where it seems that giving some of these tasks to an AI model instead of a human would save both time and money.

That’s going to have some disruptive effects on the economy as we know it. The occupations selected for the study were those contributing most to total wages and compensation in the nine industries that contribute over 5% of US GDP.

This deliberate focus parallels the strategy of AI labs and VCs looking at the "total addressable market of salaries" to determine which markets can be most disrupted by AI technology.

In other words, GDPval is not only an evaluation framework, but also a roadmap that points to exactly which knowledge work jobs AI might disrupt.

2026 as the Year AI Starts to Overtake Humans

The GDPval results are a current snapshot, but one computer scientist and AI researcher—Julian Schrittwieser, a key player in the development of Google's AlphaGo and AlphaZero—issued a clear warning about the pace of future progress.

In a widely shared post, Schrittwieser cautioned against the trap of concluding that AI is plateauing just because it makes occasional mistakes. Extrapolating the consistent trend of exponential performance improvement, he predicts that 2026 will be a pivotal year for widespread integration of AI into the economy:

By mid-2026, he says models will be able to work autonomously for full eight-hour work days.
By the end of 2026, at least one model will match the performance of human experts across many industries.
And by the end of 2027, models will frequently outperform experts on many tasks.

This sober assessment, that “extrapolating straight lines on graphs is likely to give you a better model of the future than most experts,” is why economists are starting to sound the alarm.

A new research paper from experts at Stanford is already recommending a research agenda to address the impact of "transformative AI" on economic growth, income distribution, and human wellbeing.

Why You Can’t Afford to Have Blindspots

This confluence of evidence—the GDPval’s current proof of expert-level capability and the conservative timeline for AGI—means no one can afford to remain skeptical.

The conversation is shifting from "AI doesn't really do anything" to the realization that it's getting really good at all the things you do. OpenAI's says their goal is to keep everyone on the "up elevator" of AI by democratizing access and supporting workers through change.

But the challenge is that the most direct proof of AI’s impact is personal adoption.

As Roetzer concluded, when you stop to look at the tasks that make up your job, you can see the change happening. The light bulb moment, where people realize how incredibly helpful and efficient the tools are when applied to their everyday work, is the moment the economy truly begins to transform before all our eyes.

But if you don’t use the tools enough to reach that point, you risk developing some serious blindspots when it comes to AI’s impact on your career.

View full post