The one I trust

A new model dropped yesterday, and for once I didn't rush to the leaderboards.

Anthropic put out Claude 3 — three of them, Opus at the top. The benchmark numbers are good. They're always good now. Every lab ships a chart where their bar is tallest, and I've stopped believing the chart tells me anything I'll feel on a Tuesday afternoon with a real problem in front of me.

So I did what I actually do. I fed it a knotty contract clause I'd been chewing on, the kind with three nested conditions and a Czech-law wrinkle the American models usually flatten. Then I asked it to rewrite a paragraph of mine that had gone stiff.

Both came back better than I expected. Not just correct — considered. It hedged where it should have hedged and committed where it should have committed. The prose had a spine. It didn't pad, didn't flatter, didn't bury the answer under six bullet points of throat-clearing.

That's the thing I keep coming back to. I no longer pick a model because it scored two points higher on some exam designed by other people. I pick the one whose judgment I'd defend to a client. The one I'd let near my writing without standing over its shoulder.

Opus is now the tab I open first. Not because a graph told me to. Because after twenty minutes I trusted it, and trust is a strange word to use about software, and I'm using it anyway.

We've been measuring these things wrong. Capability was the early question. The next question is quieter and harder: which of them do you actually believe?

I think that question is going to matter more every month from here.