Semarize
Back to Blog
Sales Coaching

AI Scorecards Don't Disagree. Your Prompt Does.

·7 min read·Alex Handsaker

Run the same sales call through your AI scoring system on Monday, then again on Thursday. If the scores differ - and they often do - the instinct is to blame the model. Temperature settings, model version updates, something opaque happening inside the LLM. That explanation feels plausible. It also points in the wrong direction.

The more common cause is simpler: the evaluation was never locked down in the first place. A freeform prompt asking an LLM to “evaluate this call and score the rep on discovery quality” is an instruction to interpret. The model re-interprets it every time - differently depending on prompt phrasing, context window contents, model version, and temperature. Same call, same prompt, different result. That's not AI randomness. It's freeform evaluation.

What “different results” actually means

Score inconsistency shows up in two forms, and the second is more damaging than the first. Numeric drift - a 7 becomes a 5 on re-run - is visible and easy to flag. Category flips are subtler: a MEDDICC field that was populated disappears; a coaching flag that fired last week doesn't fire this week on a structurally identical call; a reason code that said “strong discovery” comes back as “incomplete qualification.”

Category flips are more dangerous because they feed decisions. When a rep's MEDDICC Champion field flips between calls not because of anything the rep did differently, but because the evaluation re-interpreted the prompt, the coaching conversation is built on noise. The rep receives contradictory feedback about the same behaviour. That doesn't just reduce coaching effectiveness - it erodes trust in the scoring system entirely.

Transcription errors can explain one-off variance. They don't explain systematic score swings on identical calls when the rubric hasn't changed. The variable is the evaluation, not the transcript.

Hand-sketched diagram showing the same sales call producing different Monday and Thursday AI scorecard outputs, including a score change and category flip.
Category flips on the same call are an evaluation stability problem, not a transcript problem.

Why the prompt is the problem

Freeform prompts ask the model to do interpretation work at evaluation time. That interpretation is sensitive to four things you don't fully control: how the instruction is phrased, what else is in the context window, the model version running at that moment, and temperature settings. Change any one of those - even invisibly, as happens during a model update you didn't request - and the interpretation shifts.

The deeper issue is that most AI scorecards are measuring the wrong thing even when they're running consistently. AI scorecards are theatre unless they measure customer understanding - and most freeform rubrics are measuring rep behaviour: did they follow the script, did they ask the right number of questions, did they use the approved framework language. A scorecard that measures whether the buyer understood anything, confirmed a timeline, or articulated specific pain is measuring outcomes. A scorecard that measures whether the rep hit their talk ratio is measuring activity.

Teams buy conversation intelligence to improve coaching, then coach off the wrong signals because the rubric was built around what the rep did, not what the buyer understood as a result. Fixing drift without fixing the measurement objective produces a consistently wrong result.

Why rewriting the prompt doesn't fix it

The standard response to scorecard inconsistency is to improve the prompt. Add more specific instructions. Tighten the rubric language. Define what each score level means. That work produces better average results. It doesn't eliminate drift.

Every prompt rewrite changes the interpretation space. You tighten one dimension and introduce new variability in another. The next model update - invisible to you until scores start behaving differently - shifts the interpretation space regardless of how carefully the prompt was written. Switching to a better model has the same problem: a more capable model interpreting a freeform prompt produces better average results, but still variable results, because the interpretation work is still happening at evaluation time.

The problem isn't prompt quality. It's that freeform prompts structurally can't produce deterministic results. You can constrain the interpretation but you can't eliminate it without changing the evaluation architecture.

Hand-sketched comparison of freeform prompts affected by phrasing, context, and model updates versus an evaluation contract with question, output type, and evidence producing a stable field.
A locked evaluation contract removes interpretation from the scoring run.

The evaluation contract

The fix is to replace the freeform prompt with an evaluation contract: a locked schema that specifies what is being evaluated, what the output type is, and what evidence is required for each result. The interpretation work happens once, when you define the schema - not on every evaluation run.

In practice: instead of “evaluate discovery quality,” an evaluation contract asks “Did the buyer state a specific, quantifiable pain? Return yes/no. What exact quote from the transcript constitutes evidence?” Each evaluation unit asks one question, specifies one output type, and returns one defined result. The model isn't being asked to interpret your intent - it's applying a defined contract to what the buyer said.

This is the model Bricks and Kits are built on. A Brick is a single evaluation unit with a defined question and output type. A Kit is a collection of Bricks that runs as a reusable evaluation workflow. The same Kit run against the same transcript returns the same output shape every time. If you need to change how something is evaluated, you update the Brick - and you know exactly what changed and why.

Measure consistency, not average scores

Once the evaluation contract is locked, the right thing to measure changes. Average scores across a team or time period are useful for trend analysis, but they're meaningless if the schema is unstable - a high average on an inconsistent rubric is noise. The prerequisite is consistency: the same calls producing the same outputs across re-runs.

Measure this by keeping a held set of calls - ideally spanning the range of quality you expect to see - and re-running the evaluation Kit against them periodically. If outputs are stable, the schema is sound. If category flips appear, isolate the specific Brick that's drifting rather than rewriting the whole prompt. The sales coaching use case covers how to set up a consistency baseline before rolling a new evaluation schema to the full call population.

Treat the evaluation schema like code: version it, document changes, and track the effect of each update on output distribution. When a Brick definition changes - because the standard for “next step confirmed” was tightened - that's a version bump, not a silent edit to a prompt document. Coaching reliability depends on evaluation stability. Neither is achievable with a freeform prompt.

Hand-sketched consistency testing workflow showing a held call set run through a Kit monthly and checked for same output, category flips, and the Brick to fix.
A held call set turns scorecard consistency into something you can test and version.

A practical checklist for RevOps and coaching teams

Before trusting scorecard outputs to feed coaching conversations or CRM fields, work through these checks. Define what each evaluation unit is actually asking - one question per field, one output type per field. Verify that the rubric maps to buyer understanding signals, not rep activity proxies. Run the same five calls through the evaluation twice and compare outputs; if categories flip, the schema isn't locked. Version the evaluation schema separately from any underlying prompt document. Set a re-run cadence for your held call set - monthly is a reasonable starting point - and treat category flips as a schema issue to investigate, not a model quirk to accept.

Semarize is built on the Brick/Kit evaluation model - locked schemas, consistent outputs, stable results across model updates.

Start building →

Common questions

How do I tell if my AI scorecards are drifting or if it's just transcription noise?

Run the same call through your scoring system twice without changing anything. If the score or category outputs differ, the inconsistency is in the evaluation layer, not the transcript. Transcription quality is stable for a given call - it doesn't change between re-runs. Evaluation drift is the variable. If you see category flips - not just small numeric changes - on re-runs of the same call, your evaluation schema needs to be locked, not your prompts rewritten.

What does an evaluation contract look like in practice for call scoring?

An evaluation contract replaces a freeform prompt with a locked schema. Each field asks one question with a defined output type: “Did the buyer articulate a specific pain? Return yes/no with the supporting quote.” “Was a next step agreed? Return yes/no with the specific date and owner if yes.” The contract defines what evidence is required for each result. The model applies the contract - it doesn't interpret what you might have meant by “good discovery.”

Should we measure average score or consistency across re-runs?

Consistency is the prerequisite. A high average score on an inconsistent schema is noise - it tells you the model is interpreting your prompt in a way that produces flattering outputs on average, but individual results can't be trusted for coaching decisions. Measure consistency first by running a held set of calls across re-evaluations. Once the schema is stable, average scores and trend data become meaningful inputs for coaching and RevOps.

Can we fix this by switching to a better model?

Model quality affects average accuracy, not schema stability. A better model interpreting a freeform prompt produces better average results, but still variable results across re-runs, because the interpretation work happens at evaluation time. The fix is structural: lock the evaluation schema so the model isn't doing interpretation work - it's applying a defined contract to what the buyer said. Model choice matters less than evaluation architecture.

How do we map scorecard fields to buyer understanding instead of rep behaviour?

Buyer understanding fields ask whether the buyer demonstrated comprehension or intent: did they articulate a specific, quantifiable pain; did they confirm a timeline; did they agree to a next step with a named owner and date. Rep behaviour fields ask what the rep said or did: talk ratio, question count, framework adherence. Both are extractable from transcripts, but buyer understanding fields are the ones that predict outcomes - deals where buyers demonstrated clear understanding of the problem and next steps close at higher rates than deals where the rep followed the script but the buyer remained vague.

How often should we update the rubric without breaking coaching reliability?

Treat rubric updates as version changes: document what changed, run the updated schema against your held call set before rolling it out, and communicate schema changes to anything that depends on the outputs - coaching dashboards, CRM fields, forecast models. For teams using scoring in active coaching programmes, a rubric change mid-quarter creates a break in the time series. Batch rubric updates to quarter boundaries where possible, or run the new and old versions in parallel during a validation window before cutting over.

Continue reading

Read more from Semarize