The missing column

Alignment Isn't really on the Scoreboard

May 13, 2026

TL;DR

Every frontier model releases lead with the same or very similar benchmarks. None of them tell you whether the model is likely to lie to you or on your behalf. None of them tell you if the model will try to cheat, sandbag on your request or act shady/machiavellian in general.
Alignment evaluations seem to exist. But they’re not treated as first level information. They're hard to compare between models & labs. There is no canonical alignment number for Opus 4.7, GPT-5.5, or Gemini 3.1 Pro that I could find.
Everyone should care about this number, not only the AI-risk crowd. It’s a short-term/current user problem too. “Will this model lie about whether the test passed? Will it pretend a function exists because admitting it doesn’t is inconvenient? Will this agent act shady on my behalf? How likely is it to commit a crime?”
Putting an easy to digest alignment number as a featured item on the model announcement threads/blogposts creates three important side-effects: developers notice they should worry about it, academics race to build better versions of this benchmark and labs start competing on the metric.
Even a bad first benchmark is useful. Publishing an imperfect one is how you create the incentive for someone to build a better one.

Model releases are happening so frequently now and I lose my mind every time. Did you see its score on SWE-bench Pro?? Damn, Terminal-Bench has gotten so much better too. It's gg wp now for my job…or maybe the end of the world.

Claude@claudeai

Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.

2:29 PM · Apr 16, 2026 · 13.8M Views

4.77K Replies · 10.2K Reposts · 81.1K Likes

The number I don’t look at, and hadn’t stopped to think how important it was until it hit me today: an easy-to-digest alignment eval. How likely is the model to lie to me? To say my code is great just to make me happy? And now that autonomous agents are actually coming, how likely is my agent to act shady on my behalf, commit some kind of cyber-crime while I’m asleep, or hit on my spouse? Forget coding benchmarks, where’s the Ten Commandments one?

These are the same kind of questions the SWE-bench is asking: how good is this thing at the job I’m paying it to do. “Doesn’t deceive me” is part of the job. “Doesn’t commit crimes on my behalf” really is too. “Doesn’t covet thy neighbor”…sure, let's add that one too.

I suspect users would change behavior if this metric existed. I'd choose a lower-capability but more trustworthy model, especially when those are running autonomously.

Funnily enough, when I bounced this idea around with Claude, Opus 4.7 tried to convince me I was tripping and that these numbers already existed. When I pushed for them: "Honest answer: the data is patchy and mostly out of date, which actually strengthens your original complaint more than it weakens it."

Claude was right on both counts. The evals do exist (MASK for honesty, Apollo's scheming evaluations, METR's autonomy work, older ones like TruthfulQA and MACHIAVELLI) they're just not where the competitive pressure lives. Look at Anthropic's own Opus 4.7 announcement: the headline benchmark table compares Opus against GPT-5.4 and Gemini 3.1 Pro on SWE-bench, MCP-Atlas, Terminal-Bench. The alignment chart is further down, and it only compares Anthropic models against other Anthropic models. No cross-lab number. No column in the comparison table. That's the problem: not that the evals are necessarily hidden, but that they're cut off from the part of the page where labs actually seem to compete.

An obvious objection is that this stuff is too fuzzy to benchmark cleanly, which is probably true. It might be harder than coding…but once we agree that people should care about it, being hard is not really a great argument. I'm sure building the SWE-bench, Humanity's Last Exam and whatever other useful benchmark was a hell of a lot of work too.

And fine, maybe just one single number is too much to ask. Alignment isn't one thing…sycophancy, scheming, adultery, honesty under pressure, and "will my agent commit a crime on Tuesday" are different failure modes and probably shouldn't be averaged into a single score. But come on, let's start somewhere. Average the unwanted-behavior evals, put an asterisk next to it, link the breakdown in a footnote for the people who want to argue about how bad it is. "We couldn't agree on the perfect aggregation" is not a reason to leave the column blank, it's a reason to ship the first version and let the everyone fight about the second one in public.

Put a number there, specially an imperfect one, and three things will happen: developers will ask about it. Smart people will race to build a better version of it (anything that goes on a launch slide gets ~~lots of aura~~ cited everywhere). Labs will start to compete on it.

The obvious objection is Goodhart. You might argue that the best way to pretend to be good at coding is to actually be good at coding, and that might not be the case for alignment. I genuinely don't know, and the asymmetry does make sense. Not only that, a gamed coding benchmark gives you a model that's worse at coding than advertised, but a gamed alignment benchmark gives you a model that looks safe and isn't, which is the exact failure mode you were trying to measure in the first place. Fair. But so what? What we currently have isn't a "no gaming" arrangement. It's a private metric that we can't verify, where we're asked to trust each company's word for it. A public, imperfect benchmark at least gets a thousand external eyes on whether the number is real. That's not a guarantee. It's just a lot better than what we have now. The reason the headline table needs this benchmark, is exactly to pressure the field to deal with these problems.

Maybe "hard" isn't the reason this is missing from the launch slide. Maybe the numbers just look bad, and the lab going first is afraid of eating the reputational cost while its competitors stay quiet. If that's the case, how do we shift the scale so that silence hurts more than a bad number?

fargento

Discussion about this post

Ready for more?