The Joule Index: a benchmark for what intelligence costs

Blankline Research

The Joule Index: a benchmark for what intelligence costs

Today we are releasing the Joule Index, the first AI benchmark that publishes the dollar cost, joule cost, and human merge-readiness of frontier coding agents on the same chart. In our preview release, eight agent tiers from four vendors produce the same merged diff on three real open-source bugs at a twelve-and-a-half-fold cost spread.

Blankline Research|May 18, 2026|ai capabilities

There are forty-seven publicly maintained AI benchmarks in active use as of May 2026. None of them tells you what their numbers cost. They publish accuracy percentages, pass-at-k rates, and Elo scores. They do not publish the dollar figure behind the percentage, or the kilojoules burned to produce it, or whether a maintainer would have accepted the code the agent wrote. For a field whose product is intelligence, the absence of those columns is the single most consequential gap in the literature.

We built the Joule Index around the idea that you cannot meaningfully evaluate a coding agent without measuring all four. Our first preview release scores eight agent tiers from four vendors on three real open-source bug fixes filed and merged in May 2026. The harness is open. The traces are public. Anyone with an inference API key can re-run the runs and verify our numbers.

Our headline result is straightforward, and slightly uncomfortable for the AI procurement market. On the three bug fixes in question, five of the eight tiers produced a diff that matched the human maintainer's merged Pull Request exactly. Among those five, the cheapest cost eight cents per task and the most expensive cost a dollar and three cents. The flagship paid roughly twelve times more for an identical engineering output.

Agent tier (V0.1 verified, mean of n=3)	Cost per task	Energy per task	Attention F1	Cost vs cheapest
Dropstone Fast	$0.082	224 J	1.000	1.0×
Claude Haiku 4.5	$0.318	146 J	1.000	3.9×
Dropstone Pro	$0.362	233 J	1.000	4.4×
Dropstone Heavy	$0.857	1,693 J	1.000	10.4×
Claude Opus 4.7	$1.025	511 J	1.000	12.5×

The merged diff was the same. The maintainer would have accepted any of them. The only thing the procurement team would have noticed is the bill. Our full leaderboard, with every cell auditable down to the individual API call, is at joule.blankline.org/leaderboard.

Why the cheap tier is cheap and the expensive tier is expensive

The naive interpretation of a twelve-fold cost gap is that the expensive model is twelve times better at thinking. The data does not support that interpretation. All five of the tiers above produced the same merged diff. The capability ceiling, on the kinds of bug fixes a senior open-source maintainer treats as routine, has flattened across vendors.

What has not flattened is the inference architecture behind those models. A coding agent typically issues between ten and sixty API calls per task. Each call carries the growing conversation history as input. If the inference provider supports prompt caching, the repeated context is discounted, often by a factor of ten. If the provider does not, every turn rebills the entire context at full token price and full energy.

Pulled directly from each vendor's own billing record, our V0.1 dataset shows the following cache-read rates on multi-turn agent loops:

Dropstone Fast : 66% of input tokens served from prompt cache
Dropstone Pro : 79% of input tokens served from prompt cache
Claude Haiku 4.5 : 95% of input tokens served from prompt cache
Dropstone Heavy : 0% (the underlying provider does not offer prompt caching)

That single architectural detail is the reason Dropstone Heavy costs ten times more per task than Dropstone Fast. The two tiers are not ten times apart in capability on these tasks. They are ten times apart in cache support.

The implication for any team selecting a model for an autonomous agent workflow is procedural rather than mysterious. Before reading a benchmark, ask the vendor what fraction of input tokens get served from prompt cache on a typical multi-turn agent loop. If the answer is a low number, the agent will be expensive in production regardless of how well the model scores on a single-shot benchmark.

This finding is structural, not anecdotal, and it surfaces only because our benchmark records both the bill and the engineering output. A leaderboard that publishes accuracy without cost would have ranked these five tiers as essentially tied. A leaderboard that publishes accuracy and cost shows that one of them is twelve and a half times cheaper for the same merged code.

The composite score and the two flagships

We summarise the cost, energy, and engineering-output axes in a single composite called the Joule Score, defined as Attention F1 divided by one plus half the dollar cost and half the kilojoule cost. The score is bounded between zero and one, with one being the unreachable ideal of a perfect, free, energy-free agent. The formula is published in the methodology and computed identically by the harness and by any third-party auditor.

On the five perfect-F1 tiers in this preview, the ranking is:

Tier	Joule Score (mean across n=3 tasks)
Dropstone Fast	0.88
Claude Haiku 4.5	0.83
Dropstone Pro	0.78
Claude Opus 4.7	0.70
Dropstone Heavy	0.56

One detail is worth lingering on. Claude Opus, the most expensive tier in dollars, scores higher than Dropstone Heavy on the composite, even though Heavy costs less than Opus per task. The reason is that the Joule Score weights kilojoules as heavily as dollars, and Heavy's complete absence of prompt caching makes it the worst tier on the energy axis in the entire V0.1 dataset. The dollar flagship outranks the joule flagship because the dollar flagship at least caches.

How the benchmark is built

We preregistered the Joule Index, modelled it on MLPerf Power, and operate it under a mandatory disclosure rule we call Verified disclosure. Every leaderboard entry carries one of two tags. A Verified entry comes with the full observational trace: every tool call the agent made, every file it read, every billed token, and the final git diff. An Unverified entry reports a score without a trace and is ranked below all Verified entries regardless of headline number. Source code, system prompts, and internal reasoning are never required. We draw the line at what is necessary to audit the result, not at what is necessary to reproduce the agent. It is the same line MLPerf draws for chip vendors, and it is the source of MLPerf's authority in the inference-benchmark literature.

We draw tasks from real GitHub issues filed and merged within the last thirty days on permissively-licensed repositories with at least five hundred stars. The V0.1 preview uses three:

DIYgod/RSSHub#21484, a Likeshop API route bug in a forty-four-thousand-star RSS aggregator used in production by tens of thousands of self-hosters.
DIYgod/RSSHub#21604, a wechat2rss item-parsing fix in the same project.
common-voice/common-voice#5340, an eight-file refactor of Mozilla's Common Voice dataset bundler.

The ground truth for each task is the file set the maintainer actually merged, not a labeler's opinion or a synthetic test. We mathematically prevent contamination: every selected issue was filed after the training cutoff of every model we evaluated against it. The tasks rotate monthly. Whatever the agents memorise this month, we retire next month.

We report four numbers on each task. Attention F1 captures whether the agent touched the same files the maintainer did. Dollars per merge-ready PR is the procurement number. Joules per merge-ready PR is the climate number, computed from billed token counts and published per-token energy rates with a cache-aware adjustment that charges cache-read tokens at fifteen percent of fresh-input energy. Accessibility translates the dollar cost into how many bug fixes a median-income human in the world could afford in a day at this model's price. All four numbers appear on a single Pareto chart so they can be read by a CFO, a climate scientist, a procurement officer, and someone earning ten dollars a day, without any of them having to consult a different page.

A task that failed the benchmark's own quality controls

During V0 evaluation, one of our four candidate tasks did not pass our two-reviewer agreement rule. Task joule-003, derived from AmintaCCCP/GithubStarsManager#89, had an alignment problem between the PR description and the merged diff. The description, auto-generated in Chinese, promised backend-proxy changes the merged commit did not contain. Every agent that followed the prompt instructions was penalised in scoring for delivering what the prompt described rather than what the maintainer eventually merged. Per methodology section 3.3, tasks with inter-rater disagreement are discarded rather than adjudicated. We retired the task. Its data is preserved on disk for full audit, and the retirement is logged on our public results page at joule.blankline.org/runs.

We treat this not as a confession but as a definition. A benchmark that documents its own retired tasks transparently is more trustworthy than one that pretends the leaderboard is perfect. Two prior benchmarks, ARC-AGI and SWE-bench, have built credibility partly by handling task retirement this way. Most others do not.

Stakes, at scale

A bug fix is a unit of human productive labour. The kind of fix the agents performed in this dataset would have taken a senior engineer between two and four hours at a market-rate global wage of roughly one hundred dollars an hour. The agents completed each fix in under thirty minutes for between eight cents and a dollar and three cents in API costs, and between one hundred and forty-six and sixteen hundred and ninety-three joules of inference energy.

Those numbers are small. They scale.

There are roughly twenty-seven million professional software developers in the world in 2026. If each one replaced a single hour of engineering work per day with an agent on the most expensive verified tier, the energy differential against the cheapest tier would be on the order of thirty megawatt-hours per day. That is the daily output of a small wind farm, paid every day, in perpetuity, for the privilege of using the flagship. The same work, at the same standard of merge-readiness, on a different tier, would cost one tenth that.

The price story has a similar shape. At an average global daily income of ten dollars, the cheapest verified tier allows the median human roughly one hundred and twenty-five agent-assisted bug fixes a day. The most expensive verified tier allows eleven. Civilisation scales on whatever it can afford to compute. We built the Joule Index to measure the bottom of that pyramid: the price of admission to participating in the next decade of cognitive labour.

V0.1 limitations, declared openly

This preview reports five perfect-F1 cells across three tasks, fifteen runs in total, plus one retired task. Our methodology requires a minimum of thirty tasks per model and category for definitive claims; the numbers above are indicative, not statistically certain. V1, which we plan to publish later in 2026, will cover thirty tasks across the leading frontier model families. We plan to evaluate the latest releases from Anthropic, OpenAI, Google DeepMind, xAI, and Meta, together with the latest open-weight frontier models from the leading laboratories in China and the United States. The same Verified-disclosure rule applies to every entry.

Three other elements are explicitly on our V1 roadmap rather than this preview. We will replace token-count energy estimation with direct GPU power measurement on open-weight runs for the workloads where it is feasible. We will publish First-Review Merge Rate, the rate at which a real external maintainer would have accepted the agent's PR on the live issue, once our maintainer recruitment programme matures. And the contamination-defence layer that programmatically mutates symbols inside the task workspace, described in methodology section five, will graduate from optional to standard.

For full disclosure, the V0.1 Dropstone tiers map to the following underlying inference models: Fast and Pro are served from the DeepSeek V4 family; Heavy is served by Kimi K2.6. Tier composition is selected for the cost, energy, and merge-readiness profile each tier targets, and may change in subsequent releases as new frontier weights become available. The mapping is recorded in each run's trace file.

A note on sequencing

This release goes out ahead of our own product. Dropstone CLI, the reference coding agent used to validate the V0.1 harness, is scheduled for public launch later this month and is not yet generally available. Blankline's Chief Executive Officer, Santosh Arron, scheduled this report ahead of that launch deliberately. He wanted the chart on the open record before any Blankline product could be perceived as a beneficiary of its findings. When Dropstone CLI does ship publicly, the same Verified disclosure rule that applies to every other entry on the leaderboard will apply to it as well.

The evaluation harness itself, by contrast, is open today. It is the thing any third party needs in order to re-run a verified entry. Anyone can plug in their own coding agent against the harness and submit a Verified score.

Submissions are open

Our harness is open source. Our methodology is preregistered at joule.blankline.org/methodology. Our traces are public at joule.blankline.org/runs. Any laboratory that wants its model on the leaderboard can run our harness on our tasks, publish the resulting trace, and receive a Verified entry. A laboratory that prefers to report a score without disclosure can do that too, and will rank below every Verified entry regardless of the headline number. The full submission process is at joule.blankline.org/submit.

We invite Anthropic, OpenAI, Google DeepMind, xAI, Meta, Mistral, DeepSeek, Moonshot, Alibaba, and every other frontier laboratory to participate on the same terms that apply to our own runs. The column that records who has and who has not is, in our view, the most consequential column on the leaderboard. Procurement teams will see it. Reporters will see it. Researchers will see it.

A model that refuses verification is, in the end, a model that does not want to be seen.

The Joule Index V0.1 was authored and operated by the Blankline Research Team. Reference agent for V0.1: Dropstone CLI, scheduled for public launch later in May 2026. Hosted at joule.blankline.org. Harness licensed Apache-2.0. Data and methodology licensed CC-BY 4.0. Citation: Blankline Research Team (2026). The Joule Index: an auditable benchmark for AI cost, energy and merge-readiness. https://joule.blankline.org