All notes
·5 min read·Suraj Malthumkar

The five-day eval rubric: how to grade an agent before you ship it

Teams that ship agents rarely write the rubric first, so they can't tell when the agent regresses or defend cost-per-run to finance. A five-day exercise that fixes both.

The single biggest gap between teams that ship agents and teams that don't is the rubric. Not the model. Not the prompt. The rubric.

A team without a rubric cannot tell you whether yesterday's agent was better or worse than today's. They cannot tell finance why the cost-per-run went up. They cannot tell ops why the escalation rate spiked on Tuesday. They are running on vibes, and vibes are fine until the first incident, at which point the project loses its political cover and dies.

We do this as a five-day exercise at the front of every agent engagement. It is not glamorous. It is the work that decides whether the project ships.

Day 1 — pick the workflow and write the contract

Before any code, write down what the agent is supposed to do, in one paragraph, in the operator's words. Not the AI team's words. The operator.

A refund-triage agent's contract reads something like: "Given a refund request with order ID, customer email, and reason code, decide approve, escalate, or templated-deny. Approve only if the order is within return window and the customer's lifetime overturn rate is under 5%. Escalate if any of three conditions hold. Send templated-deny otherwise."

If the contract takes more than a paragraph, the workflow is not ready. Pick a smaller one.

The other day-one artifact is the failure budget. How wrong can the agent be before the project is dead. For refund triage we'd typically write: under 3% manager overturn, zero approved refunds outside policy, sub-15-second p95 latency. Three numbers, signed by the line-of-business owner.

Day 2 — collect 50 real traces

Not synthetic. Not the 12 examples in the vendor demo. Fifty real ones, pulled from the live system, with the human's actual outcome attached.

Fifty is not arbitrary. It is the smallest sample where you can see the long tail without burning a week labeling. The first ten will be obvious cases. The next thirty will be the working set. The last ten will be the weird ones that decide whether the agent is shippable, and those are the ones every team skips.

Stratify the sample. If 60% of your real volume is one customer segment, your sample should reflect that. If 5% of cases are the high-dollar ones that finance cares about, oversample those because the cost of being wrong on them dominates everything else.

Day 3 — write the rubric

Four dimensions are usually enough, sometimes five.

Faithfulness: did the agent do what the contract said, given the inputs it had. Score 0, 0.5, or 1 against the human-graded outcome. No partial credit for "close."

Latency: wall time from input to final output, p50 and p95. If the agent is in a customer-facing path, p95 is the number that matters. If it's batch, p50 is fine.

Escalation rate: percentage of runs that exit to a human, with a reason code. Too low and the agent is being reckless. Too high and you haven't built an agent, you've built a triage queue.

Dollar exposure: for any run that's wrong, what's the financial impact. A wrong refund approval on a $40 order is not the same as a wrong contract redline on a $2M MSA. Bucket exposure into low, medium, high, and track the rate independently for each.

The fifth, when it applies: drift. Same input on day 1 versus day 30 — does the agent give the same answer. Stochastic models drift. You should know how much.

Day 4 — score the baseline manually

Run the 50 traces through the agent. Score each one against the rubric, by hand, with the operator in the room. Not the AI team alone. The operator.

This day is uncomfortable because it's where most prototypes get exposed. The agent that demoed beautifully on cherry-picked examples will hit faithfulness around 0.6 on a real sample, and the room will go quiet. That is the right outcome. Better to find it on day four than after launch.

Write the failure modes down by category. Not "the agent was wrong on row 23." Categories. "Misread the return window when the order had a partial refund history." "Escalated when it should have approved because the lifetime-value lookup timed out." A handful of categories typically explain 80% of failures, and that's the prompt or retrieval work for the next sprint.

Day 5 — wire the rubric into CI

The rubric only matters if it runs every time the prompt changes. A spreadsheet of scores from one Tuesday in April is a museum piece.

We use the same patterns most teams converge on by month three: a held-out eval set in version control, a runner that scores each PR against the rubric, a pass/fail threshold tied to the failure budget from day one, and a Slack notification when a regression slips through. Anthropic's eval guidance and OpenAI's evals framework both describe this loop in detail; we don't reinvent it. We wire it.

# eval.config.yaml
suite: refund_triage
dataset: traces/v1/refund_triage_50.jsonl
metrics:
  - name: faithfulness
    threshold: 0.92
    blocker: true
  - name: p95_latency_ms
    threshold: 15000
    blocker: true
  - name: high_exposure_error_rate
    threshold: 0.0
    blocker: true
  - name: escalation_rate
    threshold: [0.05, 0.25]
    blocker: false

A blocker means the PR doesn't merge. Non-blockers post to Slack and the team eyeballs them. Once this is in CI, you can finally answer the only question that matters in the next steering committee: is the agent better or worse than last week, and by how much.

Why this matters

Without the rubric, every conversation about the agent is a vibes conversation. With it, every conversation is a numbers conversation, and numbers conversations are the only kind that survive a budget cycle.

The rubric is the contract between the AI team and the business. It tells finance how to price the cost-per-run against the value-per-run. It tells ops how to know when to pull the kill switch. It tells the engineering team which prompt change actually helped versus which one just felt clever in code review.

The rubric is the contract. Without it, you're vibes.