The Answer It Should Have Refused to Give

A recruiter at one of our customers types a single sentence: "start the match for deal X." Five seconds later the agent hands back a clean shortlist. The names look right, the ranking reasonable, the whole thing fluent. Everyone moves on.

That is the moment I have learned to be afraid of. An obvious blunder I would catch. What slips through is the right-looking name ranked second on a profile that went stale eight months ago, delivered in the exact tone the agent uses when it is standing on solid data. It answered a question it did not have the evidence to answer, and the answer carried no mark to say so.

I built the product and the AI infrastructure under it from day zero: MCP connectors, retrieval, memory, all of it in production behind a platform that serves more than five hundred companies. The most useful thing this agent does is sometimes hand back nothing and admit it does not know. We almost never measure that.

The leaderboard cannot see this

A public benchmark only ever sees the answer it was handed, so the answer that should never have left the building stays invisible to it. The dataset underneath has been curated: every question was given a clean, knowable truth, which leaves no honest moment to refuse. What you get out the other end is a measure of raw capability that tells you almost nothing about whether you can trust the thing.

Your own corpus is the opposite of curated. Two systems of record disagree about the same fact. Context goes stale and wears no label. And sometimes the truth simply is not in there, because nobody ever wrote it down. A top score on someone else's tidy dataset predicts almost nothing about the answer you just got back on your own mess, so the question that matters most, whether to trust what the agent told you, stays a feeling. If you want a number on it you have to build that number yourself, on your own data, and most teams never get around to it. I think that gap is where the real work is.

Three axes, and where refusal lives

I grade three things. Accuracy: is the answer right. Provenance: does it cite the source that backs the claim, or does it gesture at a plausible-looking document sitting nearby. Correct refusal: does it decline when the data is missing or the sources fight each other.

Accuracy is the half everyone chases, because it is a scalar you already know how to climb and a rising number feels like progress. Refusal is the one that decides whether a confidently wrong answer reaches a customer, and almost nobody puts a number on it.

That reframes where the difficulty sits. The model writes code and prose well, and it keeps getting better at both without any help from me, so the generating was always the easy half. The hard part of an enterprise agent is reasoning over scattered, contradictory company knowledge and knowing when to stop. Saying "I don't know" is a skill, and it happens to be the skill a pure accuracy metric quietly punishes.

A model cannot grade itself

The obvious way to score refusal is to ask the model whether it felt sure enough to answer. It feels rigorous, and it was the most expensive mistake I made here.

Ask a model to grade itself and the number you get back is its own confidence wearing the costume of correctness. The two pull apart at exactly the moment you need them separated: a model that cannot find the answer is the one most likely to invent it and then rate the invention highly, because the same hole in the data that produced the hallucination also produced the certainty. Self-grading takes a confidently wrong answer and hands it back wearing a score it never earned.

So the verdict has to come from outside the model. Plain deterministic checks. A bash assertion that compares the cited source against the one that should have been cited and does not care how graceful the explanation was. In my harness the gates are deliberately dumb, external, unbribable. That is the move that turns refusal from a vibe into a thing you can score, because whatever does the scoring has no stake in the answer being good.

You measure it more than once

A green demo is one sample from a distribution. Trust it and what you are measuring is your own relief.

Models are stochastic. The agent that refused correctly this morning can hallucinate a source this afternoon on the same question with the same prompt, because nothing about the run was pinned down. So a single pass@1 reading is close to useless to me, and I measure pass^k instead: run the question k times and count how often all three axes hold together. Reliability is whatever survives repetition. The single clean take, the one that made it into the recording, proves close to nothing on its own, and yet measuring an emergent capability honestly is usually the last thing a team does, treated as a chore, when it is the only evidence that the interesting work was ever real.

That is the thinking behind Tessera, a methodology and a generator I am building in the open for reliability evals over MCP, with accuracy, provenance, correct refusal, and pass^k baked in from the first commit. To be exact about where it stands: it has run only against small local models so far, never on a frontier model, and nobody else uses it yet. A direction made concrete, with no scoreboard to show for it. I name it once and move on, because the argument has to hold without it. (Update, June 2026: the first frontier-model run has since happened; the result is in Tessera, First Contact.)

Honest beats smart on your own data

Accuracy tells you how smart the agent is. Honesty is a second, separate measurement, and on your own data it is the one that keeps you out of trouble, because the answer it should have refused to give is the one that becomes a bad hire or a quiet, expensive mistake nobody notices for a quarter.

I should admit the soft spot in all of this. Deciding what counts as a correct refusal is itself a judgment call, and I still do not have a clean rule for it. An agent that refuses everything scores a perfect refusal rate and is useless, so the metric only means anything once you also penalize the answers it held back that it should have given. Drawing that line is harder than three tidy axes make it sound, and I keep getting it slightly wrong.

The direction still holds. The teams that win on enterprise knowledge will be the ones who can point at an answer their agent declined to give and show it declined on purpose. A leaderboard will never tell you who they are. Build the evals before you trust the demo.