The Benchmark That Tried to Hire AI—and Mostly Sent It Home
Yesterday, a new kind of interview took place. Not a panel grilling a candidate, but a market testing a claim. The Center for AI Safety and Scale AI opened the door to a slate of agents and asked them to do paid work—real, messy, client-grade projects. They called the experiment the Remote Labor Index, and its question was brutally simple: if an AI submits a deliverable, would a client accept it and pay?
From leaderboards to invoices
The Remote Labor Index, or RLI, doesn’t play the usual game of contrived tasks and clean metrics. It assembles a catalog of freelance-style projects across product design, game development, architecture, data analysis, and video animation—more than 6,000 documented human hours and over $140,000 in market-valued work—then judges agents against the same acceptance bar a contractor would face. The premise is disarmingly concrete: not can the model solve a puzzle, but can it ship a deliverable that a real client would sign off on. The full setup is public, from the project set to the methodology, and the paper lays out the rules of engagement in dry detail (arXiv, remotelabor.ai).
The number that matters
Across the board, the agents struggled. Manus, the top performer, automated just 2.5% of the available work end-to-end at an acceptance-worthy level; other frontier systems—including Grok, Anthropic’s Sonnet, OpenAI’s ChatGPT agent, and Google’s Gemini—clustered below that already thin slice. The authors don’t bury the verdict: current agent performance sits near the floor. Independent coverage echoed the finding in the language that matters to labor markets—agents captured only a tiny fraction of the fees on offer.
Why this lands differently
RLI reframes the automation debate from “can models do X in principle?” to “will someone pay for what they produce?” That shift exposes where today’s agents actually break: not on single-shot competence, but on the unglamorous seams of real work—setup friction, toolchains, data hygiene, ambiguous instructions, revision cycles, and the last 10% that turns a draft into a deliverable. An agent that dazzles on a benchmark can still flinch at a client brief that mixes Figma, Blender, spreadsheet quirks, and a half-specified brand voice. RLI quantifies that gap in units that CFOs respect.
The employment signal amid the noise
For employers, the message is practical. If you were budgeting for autonomous software colleagues to displace broad swaths of remote white-collar roles in 2025, RLI suggests you’re early. Tool-augmented humans remain the profit center; fully agentic replacements remain boutique. For workers and policymakers, the near-term displacement picture narrows: substitution is likely to be task-specific and opportunistic rather than sweeping. The result doesn’t absolve us from preparing for rapid change—progress is measurable, and RLI was designed as a time series—but it does anchor the conversation in evidence instead of projections.
The edges of the map
RLI is not the whole economy. It focuses on remote, freelance-style knowledge work and does not cover tightly engineered enterprise processes where inputs, tools, and acceptance criteria are standardized. Those corridors may be more hospitable to agents, especially when wrapped in guardrails and preintegrated systems. And agent frameworks evolve quickly; orchestration, memory, and tool-use scaffolds are changing month to month. The authors acknowledge this and built RLI to be rerun as the systems improve. Today’s 2.5% is a snapshot, not a ceiling.
What would have to change to bend the curve
To move from 2.5% to 25%, agents need less “know-how” and more “ship-how.” That means reliability across long horizons, robust recovery from tool and API failures, better adherence to evolving specs, stronger judgment about when to ask for clarification, and native fluency with heterogeneous assets—files, repos, design systems—without human handholding. It also means aligning incentives: acceptance-based evaluation pushes research toward closing the last-mile gap rather than just inflating scores on synthetic tasks.
The narrative reset
RLI doesn’t say agents are useless; it says they don’t yet clear the bar that turns a demo into a paycheck. That’s a useful correction in a year saturated with videos of hands-free systems clicking through browsers and deploying code. The takeaway is less dramatic but more consequential: today’s impact comes from weaving AI into human workflows, not swapping humans out. If that changes, RLI is positioned to be the ledger that records it—in revenue terms, not just rhetoric.
In the meantime, the economy looks the way it often does at the start of a technological shift: productivity rises where humans learn to aim the tools, not where the tools run the shop. Yesterday’s benchmark didn’t end the debate about agentic work. It did something rarer—it gave the debate a denominator.

