New Benchmark Shows AI Agents Perform Poorly When Automating Real Jobs

Written by Mike Kaput | Nov 5, 2025 1:30:00 PM

A new paper from the Center for AI Safety and Scale AI has introduced the Remote Labor Index (RLI), the first benchmark designed to measure how well AI agents can perform paid, remote jobs.

The RLI benchmark includes real-world projects from freelance platforms, spanning complex fields such as game development, architecture, data analysis, and video production. These aren’t simple tasks: The projects represented over 6,000 hours of human work valued at more than $140,000.

The results? Current AI agents performed poorly.

Manus, the top-performing agent, could only automate 2.5 percent of the work. Other top models, such as Grok 4 and Sonnet 4.5, managed just 2.1 percent, while GPT-5 hit 1.7 percent and Gemini 2.5 Pro came in under 1 percent. The researchers noted failures stemmed from incomplete deliverables, broken files, and low-quality work that wouldn't meet professional standards.

While these low numbers might seem reassuring to human workers, they don't tell the whole story. To understand what these findings really mean for the future of AI in the workforce, I discussed them with SmarterX and Marketing AI Institute founder and CEO Paul Roetzer on Episode 178 of The Artificial Intelligence Show.

Why General Agents Are the Wrong Measuring Stick

Roetzer wasn’t surprised by the low automation rates, noting that the benchmark tests general agents that aren't specifically trained for these complex jobs.

The real and much faster progress is happening with specialized agents. He points to examples including OpenAI reportedly hiring Goldman Sachs bankers to train models to do the job of an investment banker.

"My guess is OpenAI's is way further along than 2.5 percent for that specific thing," he says.

This highlights a crucial distinction in how we should think about AI's capabilities. The RLI provides a valuable baseline for general models, but the true economic impact will likely come from models intensely focused on a specific job.

Good at Tasks Not Yet at Jobs

Roetzer explains this using a simple framework: tasks, projects, and jobs.

Right now, AI is very good at the task level, which includes the small, discrete activities that make up a larger project.

"It’s good at the tasks," he says. "It's not good at doing the full thing."

An agent can't replace a CEO, for example, but it might help with 25 different tasks that a CEO does every month. Humans, however, are still essential for setting goals, planning, connecting data sources, integrating tools, and, most importantly, overseeing and verifying the AI output.

The Economic Turing Test

The key metric to watch, according to Roetzer, is how long an agent can work without a human needing to intervene, a concept he calls "actions per disengagement," similar to how Tesla measures self-driving.

We haven't yet reached what he calls the "economic Turing test," where the economic labor of AI is indistinguishable from that of a human.

"Is it to the point where I would hire an agent or a symphony of agents instead of a human?" he asks. "In every instance I can think of, the answer is still no."

However, agents are getting better, more autonomous, and more reliable within specific jobs slowly but surely. And even augmentation of people with AI agents may lead to a reduction in the number of people needed, says Roetzer.

"As the agents get more autonomous, as they get more reliable, as more companies understand how to build and integrate them into workflows, you don't need as many people doing the work that you previously did.”

View full post