Sales Coaching

AI scorecards are theatre unless they measure customer understanding

April 3, 2026·6 min read·Alex Handsaker

When most people picture AI call scoring, they're picturing talk ratios, question counts, agenda adherence, next step mentions - software that listens to a call and tells you whether the rep did the things reps are supposed to do. It's a legitimate starting point. These signals are detectable, measurable, and correlate with good calls often enough to be worth tracking.

The problem is that they correlate with good calls without being the thing that makes a call good. A rep can nail every one of those metrics while the buyer leaves the call with no real understanding of the product, no clarity on why it solves their problem, and no intention of moving forward. The scorecard shows a strong performance; the conversation didn't actually work. And if you're coaching from those scores, you're optimising inputs that may have nothing to do with outcomes.

What AI scorecards are actually measuring

Most AI call scoring evaluates observable rep behaviours: talk ratio, questions asked, whether they mentioned pricing, whether they set an agenda. These are proxies - correlated with good calls, but not the thing that makes a call good.

A call is good when the buyer understands the problem, connects it to your product, and has enough information and confidence to take a next step. That's a buyer outcome. Measuring whether the rep followed a script is measuring the input, not the output.

This distinction matters because the two are not the same. A rep can follow every coaching guideline and still leave the buyer confused; a rep can deviate from the script entirely and run the best conversation of the quarter. Scoring script adherence and scoring buyer outcomes are two different measurements, and conflating them produces coaching that optimises the wrong thing.

Hand-sketched comparison of rep input metrics such as talk ratio, agenda, and questions versus buyer outcomes such as clear pain, understood fit, and real next step. — Rep inputs can be useful, but they are not the same as buyer understanding.

Where scorecards drift

The failure mode is quieter than it looks. Scorecards reward compliance because compliance is measurable: the rep hit the talk track, asked a question at the right stage, followed the sequence. The score looks consistent and objective. Inside the buyer's head, nothing is guaranteed - a rep can run the discovery checklist, get a high score, and still leave the call with the buyer misunderstanding the problem scope. The scorecard never flags it because the rubricnever required evidence like “buyer stated the problem in their own words” or “buyer repeated the agreed-to success criteria.”

The practical tell is straightforward: if your scorecard can't point to explicit buyer comprehension evidence in the transcript, it's reporting, not coaching. Not a summary that sounds aligned, not the fact that the buyer didn't push back - actual lines you can point to. What the buyer said they understood. What they repeated correctly. What misconceptions persisted after the rep's explanation. If the scorecard can't cite that, it's guessing.

The common fix is more enablement content: more playbooks, more coaching frameworks, more polished follow-ups. That improves rep behaviour, not buyer interpretation, and the gap between the two is where theatre lives.

Hand-sketched scorecard dashboard showing a high score while the buyer remains unclear because buyer evidence is missing from the evaluation. — A high score is theatre if the scorecard cannot point to buyer-side evidence.

What measuring buyer understanding looks like

Measuring buyer understanding means asking different questions in your evaluation schema. Not “did the rep ask open questions?” but “did the buyer articulate a specific pain in their own words?” Not “did the rep mention pricing?” but “did the buyer indicate they understood the pricing model and responded to it?” Not “was a next step mentioned?” but “was a next step agreed, with a specific owner and a real date?”

These questions require semantic evaluation: you're looking for evidence of what the buyer said and did, not what the rep said and did. That evidence is in the transcript - but only if you're looking for it.

When you evaluate for buyer outcomes, the signal you get back is fundamentally different from rep behaviour scores. A deal where the buyer articulated clear pain and confirmed a timeline looks very different from a deal where the rep covered all the topics but the buyer stayed non-committal. Both might score similarly on a rep behaviour scorecard; they should not score similarly on a buyer understanding scorecard.

Why this ends up as theatre

When scorecards measure the wrong thing, coaching optimises the wrong thing. Reps learn to game the behaviours the scorecard measures: ask more questions, mention next steps explicitly, structure calls to hit the agenda items. Scorecard scores go up. The actual quality of buyer interactions doesn't necessarily follow.

The machinery is running, the dashboards are full, the coaching conversations are happening - but they're all based on inputs that correlate with quality rather than on quality itself. Evaluation design is where this breaks, not AI capability. The AI will faithfully evaluate whatever you tell it to evaluate; the question is whether that's the right thing to be measuring.

Building a scorecard that measures the right thing

The fix requires changing what you define as the unit of evaluation. Start with “what does a successful conversation look like from the buyer's perspective?” rather than “what did the rep do?”

In a successful discovery call, the buyer articulates a specific, quantifiable problem. They describe the current state in detail. They react to the proposed solution in a way that shows they understand it. They either commit to a next step or explain specifically what's in the way. Each of those outcomes can be evaluated as a Brick - a discrete evaluation unit with a defined output type and evidence spans showing what the buyer actually said.

The rep behaviours you care about - asking questions, active listening, following a framework - show up in the evidence. If the buyer articulated specific pain, the rep probably asked good questions to get there; you see the outcome and can infer the input. The reverse isn't true: you can see that the rep asked questions without knowing whether the buyer understood anything.

Hand-sketched buyer-understanding scorecard showing transcript evidence flowing into Bricks for specific pain, pricing understood, and next step agreed, producing a coaching signal. — Buyer-outcome Bricks turn transcript evidence into coaching signals that can be tracked.

When a rep's buyer understanding scores are low, you know the conversations aren't landing - regardless of whether they're hitting their call structure targets. Track lift on understanding signals over time, not on rubric completion. That's the coaching metric that connects to deal outcomes, and the one that tells you whether the coaching programme is actually working.

Common questions

How do we measure buyer understanding without it being subjective?

You measure observable comprehension events, not understanding as a feeling. What did the buyer repeat correctly? What misconception persisted? What did they explicitly accept or restate later in the call? These are verifiable against the transcript.

What should we do if our current scorecard only tracks rep actions?

Keep rep-action items that demonstrably support buyer understanding, but don't let them stand in for comprehension evidence. Rewrite rubric items so each score requires buyer-side evidence - if there's no buyer signal, it's a behaviour metric, not a coaching signal.

Can we rely on AI summaries, or do we need deterministic signals?

AI summaries are useful for individual call review, but they're not deterministic. A summary can sound aligned while missing the exact comprehension failure. Deterministic signals tie scoring to verifiable conversation evidence - the same evaluation, run the same way, every time.

What metrics should we track to prove coaching is working?

Track lift on buyer understanding signals over time and connect it to deal outcomes where possible. The first proof is whether comprehension events increase across reps - not whether reps hit more rubric checkboxes or complete more coaching modules.

Semarize is a conversational intelligence API built around buyer-outcome evaluation. Define what a good conversation looks like, run it against every call, get structured data back.

Start building →

Continue reading

AI Call Scoring Is Theatre Without a Knowledge Layer

AI call scoring that runs on a good LLM with a well-written rubric can look accurate until you test it against what actually happened. The failure isn't one missing check. Every commercial dimension worth assessing has multiple facets, and each facet requires its own grounded document to evaluate properly. A knowledge layer is what makes scoring checkable across all of them rather than plausible about none of them.

Read post

Sales Coaching

Conversation Intelligence Isn't Enablement Analytics. Here's What Is.

Sales enablement teams buy conversation intelligence to measure coaching impact, then find the dashboards don't produce what they need: consistent rubric scoring, queryable time-series data, and before-and-after skill lift metrics. Visibility into calls and measurement of skill development are different problems - and most CI tools only solve the first one.

Read post

Sales Coaching

Why Conversation Intelligence Doesn't Drive Behavioural Change (and What Does)

Eighteen months into a CI implementation, many teams find that call scores have improved but win rates haven't moved. The data is there. The dashboards are running. The coaching is happening. What's missing is the step where insight becomes a different behaviour in the next conversation - and CI alone doesn't close that gap.

Read post