A recruiter at one of our customers types a single sentence: "start the match for deal X." A few seconds later the agent hands back a clean, confident shortlist. The names look right, the ranking reasonable. Everyone moves on.
That is the moment I have learned to be afraid of. I could catch an obvious mistake. The one that slips through is the right-looking name, ranked second on a profile that went stale months ago, fluent and confident, with nothing in the answer to flag it. The agent answered a question it did not have the data to answer, in the same tone it uses when it does.
At Cosmico I built the product and the AI infrastructure behind it from day zero: MCP connectors, retrieval, and memory, in production behind a platform serving more than five hundred companies. The most useful thing this agent does is sometimes return nothing and admit it does not know. That capability is almost never the one we measure.
The leaderboard cannot see this
A public benchmark scores the answer it was given. The answer that should never have been given does not show up anywhere, because the dataset underneath is curated. Someone made sure every question has a clean, knowable truth, so there is no correct moment to refuse. That measures raw capability. It tells you nothing about trust.
A company's own corpus is the opposite of curated. Two systems of record disagree about the same fact. Half the context is stale and nothing marks it as stale. Sometimes the truth is not there, because nobody ever wrote it down. A high score on someone else's clean dataset tells you almost nothing about the answer you got back on your own mess. And the thing you want to know, whether you can trust what this agent just told you about your data, stays a feeling. No benchmark can measure vibes. If you want the number, you have to make it yourself, on your own corpus, and most teams never do.
Three axes, and where refusal belongs
I score three things. Accuracy: is the answer right. Provenance: can it cite the source that backs the claim, or does it point at a plausible-looking one sitting nearby. Correct refusal: does it decline when the data is not there or the sources conflict.
Accuracy is the half everyone chases, because it is a scalar you already know how to climb, and watching a number go up feels like progress. Refusal decides whether a confidently wrong answer reaches a customer, and almost nobody puts a number on that.
This changes where the difficulty sits. It was never generating code or prose; the model is already good at that, and getting better without my help. The hard part of an enterprise agent is reasoning over scattered, contradictory knowledge and knowing when to stop. Saying "I don't know" is a skill, and it is the skill the cheap metric punishes.
A model cannot grade itself
The obvious way to score refusal is to ask the model whether it was sure enough to answer. It feels rigorous. It was the most expensive mistake I made on this.
When you ask a model to grade itself, the number you get back is its confidence, dressed up as correctness. The two come apart at the moment you need them apart. A model that cannot find the answer is the one most likely to invent it and then rate the invention highly, because the gap in the data that produced the hallucination also produces the certainty. Self-grading launders the confidently wrong answer: it comes back wearing a score it never earned.
So the verification has to live outside the model. Plain deterministic checks. A bash assertion. Something that cannot be talked into a yes, that does not care how fluent the explanation was. In my harness the gates are exactly that: deterministic, external, dumb on purpose. That is the move that turns refusal from a vibe into something you can score, because the thing doing the scoring has no stake in the answer.
You measure it more than once
A green demo is one sample from a distribution. Trust it and the thing you are measuring is your own relief.
Models are stochastic. The agent that refuses correctly today can hallucinate a source tomorrow on the same question, with the same prompt, because nothing about the run was pinned. So you measure pass^k instead of pass@1: run the question k times and count how often all three axes hold. Reliability is whatever survives repetition. A single good run, the one that made it into the recording, proves almost nothing. Measuring an emergent capability honestly is a discipline of its own, and right now it gets done last, treated as a chore, when it is the only thing that tells you whether the interesting work was real.
This is the thinking behind Tessera, a methodology and a generator I am building in the open for reliability evals over MCP, with accuracy, provenance, correct refusal, and pass^k in the design from the start. Its status, to be exact: validated so far only on small local models. It has never run on a frontier model, and nobody uses it yet. Call it a direction made concrete, with no scoreboard yet. I name it once and move on, because the argument has to stand without it.
Honest beats smart on your own data
Measure accuracy and you learn whether the agent is smart. Whether it is honest is a different question, and on your own data that is the one that keeps you out of trouble, because the answer it should have refused to give is the one that turns into a bad hire, or a quiet, expensive mistake nobody catches for a quarter.
I will admit the soft spot in this. Deciding what counts as a correct refusal is itself a judgment call, and I do not have a clean rule for it yet. A model that refuses everything scores perfectly on refusal and is useless, so the metric only means something when you also weigh the answers it withheld that it should have given. That line is harder to draw than three tidy axes make it sound.
Still, the direction holds. The teams that win on enterprise knowledge are the ones who can point to an answer their agent refused to give and show it refused on purpose. The leaderboard will not tell you who they are. Invest in evals before you trust the demo.