RevOps

AI Call Scoring Is Theatre Without a Knowledge Layer

May 6, 2026·7 min read·Alex Handsaker

AI call scoring that runs on a strong LLM with a well-written rubric produces results that look accurate. Scores arrive, categories get assigned, coaching flags surface. The problem shows up when you test those results against what actually happened: a rep misquoted a price tier and the scoring said the commercial conversation was handled well. A buyer was clearly outside your ICP and the discovery score came back positive. A competitor claim the rep made has been inaccurate for months, and the competitive positioning field keeps returning green. The evaluation was never wrong from the model's perspective. It was wrong from yours.

The failure mode is the same across every dimension: the model evaluates against what it can infer a good sales call looks like, not against what your company has actually defined. Without access to your documents, it falls back to training data assumptions. Those assumptions may be plausible enough that you don't notice they're wrong until a deal slips, a compliance complaint surfaces, or a rep who scores well keeps missing quota.

What a knowledge layer actually does

Knowledge grounding means attaching your organisation's documents to the evaluation so that scores become checks against your defined standards rather than inferences from training data. The distinction matters because it changes what the evaluation is actually doing. Without grounding, a field like “commercial discussion handled well” is an impression: the model assessed whether the conversation sounded competent against a general sense of what competent looks like. With grounding, the same field becomes a set of verifiable claims: was the price correct, was the value case made in the terms your methodology defines, was the scope of what's included accurately described. Impressions are provisional. Verifiable claims are auditable.

The conversation intelligence fails on knowledge, not calls framing covers why this distinction matters for everything downstream of the score. The transcript isn't the problem. What the evaluation has access to when reading it is.

Hand-sketched AI call scorecard showing pricing, ICP, and competitor checks that look accurate while hidden issues show wrong price, outside ICP, and old claim because company rules are not in context. — Ungrounded scoring can look plausible while missing the rules that make the answer correct for your business.

What depth within a single area looks like

The real value of a knowledge layer isn't catching one kind of error. It's that any commercial dimension worth assessing has multiple facets, and each facet requires its own grounded evidence to evaluate properly. Take a single area as an example. The question “how did the commercial discussion go?” isn't one question. It includes whether the price stated was accurate, whether the case for that price was made in terms the buyer cared about, whether the rep correctly described what the buyer receives for that investment, and how the rep handled pressure when the buyer pushed back. Each of those is a distinct evaluable question. Each needs a different document to answer it reliably.

A rate card tells you what prices are. It doesn't tell you whether the rep articulated the return the buyer could expect, or whether they correctly framed what the price actually covers. A value framework tells you what commercial outcomes the offering is meant to deliver, and against what. It doesn't tell you what the approved boundaries are when a buyer pushes on price. A packaging guide defines the scope of each tier. A negotiation policy defines what flexibility exists and under what conditions. Each document has a specific evidential purpose. Without that specificity, you end up with one document that tries to cover everything, which means the evaluation can't check anything precisely. The Bricks that produce useful scoring are the ones grounded in documents with a clear, bounded purpose.

This pattern holds across every assessment area. ICP qualification isn't one check: it's firmographic criteria, use case match, stakeholder seniority, and exclusion conditions, each answerable only against the document that defines the standard. Methodology adherence isn't one field: it's stage-specific requirements, each grounded in what your playbook says is required at that point. When you build assessment areas as structured sets of grounded checks rather than single rubric scores, the scoring is precise because each question is specific, and the evidence is interpretable because each answer comes from a document with a defined purpose.

Hand-sketched commercial discussion diagram split into price accuracy checked against rate card, value case against framework, scope against packaging guide, and discount against policy. — One commercial area often needs several grounded checks, each tied to a different source of truth.

Why prompt engineering can't substitute for this

The instinct when scoring is wrong is to improve the prompt: more specific rubric language, more detailed instruction, more context in the system message. This helps at the margins but doesn't solve the underlying problem. You can't pack your pricing rules, value framework, packaging scope, negotiation policy, ICP criteria, stage requirements, and competitive standards into a prompt and expect the model to apply each of them precisely across hundreds of calls. The context competes with itself, and the model generalises where it should be checking.

The signal that you're still in prompt-dependent territory is score drift. If rewriting the evaluation instructions produces meaningfully different scores on the same calls, the evaluation is interpreting rather than checking. A grounded evaluation produces the same result regardless of how the question is phrased, because the answer comes from the document. If your scores shift when your prompts shift, the knowledge isn't in the system.

What this enables for RevOps

Comprehensive grounded scoring changes what you can trust when it reaches your CRM, forecast model, or coaching workflow. A field that assesses a commercial dimension as a single score is a liability: it can be wrong in ways invisible until a deal slips, because the single score averaged across five distinct facets may look fine even when two of them failed. Fields that assess each facet separately, grounded in the document that defines the standard, give you evidence at the level where the problem actually occurred. That evidence is what makes a field usable downstream rather than decorative.

Treat ungrounded scores as provisional inputs. Use them for directional trend detection and early-stage coaching signals. For anything that feeds a CRM field, a pipeline risk flag, a performance review, or a compliance audit, the evaluation needs to be checking against documents with defined purposes, not inferring from training data against a generic rubric.

How Semarize is built for this

Semarize supports knowledge grounding at the Kit level, and the Brick architectureis what makes multi-dimensional, document-grounded assessment possible at scale. Each Brick asks one specific question and accesses only the knowledge relevant to that question. A Brick checking whether the value case was made reads your value framework. A Brick checking whether the scope was correctly described reads your packaging guide. A Brick checking how price pressure was handled reads your negotiation policy. Because each Brick has a focused, bounded knowledge scope, the checks don't interfere with each other. Adding more Bricks to cover more facets of an area doesn't dilute the accuracy of the existing ones.

This is what makes it possible to go both broad (multiple assessment areas covered in one Kit) and deep (multiple grounded Bricks per area) without the accuracy tradeoff that comes from packing everything into a single prompt. The output is consistent structured JSON across the full Kit: one named field per Brick, each with a typed value, confidence, reason, and supporting evidence from the transcript. Because Kits are versioned in the app, scoring behaviour doesn't change when the underlying model updates. When your standards change, you update the document. The scoring reflects your current reality without a prompt rewrite. See the knowledge grounding documentation for the full setup.

Hand-sketched Semarize grounding diagram showing pricing, ICP, and competitor Bricks each connected to focused documents such as rate card, ICP rules, and battlecard before returning structured JSON. — Focused context keeps each Brick checking the document that actually defines the answer.

Semarize grounds every Brick in a document with a defined purpose, across as many facets of your evaluation as the assessment requires.

Start building →

Common questions

How do I know if my AI call scoring is theatre?

Run five calls where you know something specific went wrong: a price was misstated, a buyer didn't meet your ICP criteria, a required disclosure was skipped, a competitor claim was inaccurate. If the scores don't flag those specific issues, the evaluation is operating on inference rather than checking against your actual rules. A second test: change the prompt language and see if scores shift meaningfully on the same calls. If they do, the evaluation is interpreting your instructions rather than checking facts against documents. Both tests surface the same problem: the evaluation has no access to what your company has actually defined.

How do we decide what documents to attach and what each one is for?

Work backwards from the questions you want to answer. For each evaluable question in your scorecard, ask: what document would a human reviewer need to consult to answer this correctly? That document is the grounding source for that Brick. A question about whether the price was accurate needs a rate card. A question about whether the value case was made in the right terms needs a value framework. A question about scope needs a packaging guide. The discipline is keeping each document bounded in purpose: a document that tries to cover too many questions gives the Brick too broad a scope to check against precisely. One question, one document, one Brick is the pattern that produces auditable results.

What does score drift look like in practice?

Take ten calls that have already been scored and rewrite the evaluation prompt without changing the underlying criteria: rephrase the instructions, adjust the rubric language, change the output format request. Run the same calls again. If scores shift meaningfully on individual calls, or if boolean fields change for the same evidence, the evaluation is responding to phrasing rather than facts. Score drift is the clearest indicator that you don't have a knowledge layer. A grounded evaluation produces the same result regardless of how the question is phrased, because the answer is in the document, not in the interpretation of the instruction.

How do we measure whether grounded scoring predicts deal outcomes?

Take a sample of closed won and closed lost deals and run the grounded Kit against their discovery and qualification calls. Check whether specific grounded fields differ between the two groups: do lost deals show more failures on particular facets of key commercial dimensions? If the distributions differ and the effect size is meaningful, those fields have predictive validity for your motion. If they don't differ, the fields may be measuring the right things but at the wrong stage, or the grounding documents need tightening. Start with the fields closest to close: accurate commercial handling and ICP match are typically the most predictive; methodology gaps in earlier stages take longer to show up in outcomes.

Do we need a knowledge layer for every score, or only for pass/fail decisions?

Ground any score that directly determines a coaching action, CRM field update, or compliance check. Directional scores (discovery depth as a general signal, engagement level) can run ungrounded for early detection purposes because the cost of an imprecise result is lower when used for trend analysis rather than decisions. But any field that feeds a decision (accuracy of a commercial claim for CRM enrichment, ICP match for stage advancement, disclosure completion for a compliance audit) should be grounded in the document that defines the correct answer. The cost of a wrong signal is highest where decisions depend on it.

Continue reading

Conversation Intelligence Doesn't Fail on Calls. It Fails on Knowledge.

Early CI tools were built on ML classifiers - talk ratios, question counts, keyword detection. LLMs changed what's possible. But they introduced a new risk: model knowledge. When scoring runs against what the AI infers from training rather than your pricing, ICP criteria, and qualification playbooks, outputs are plausible and wrong.

Read post

Product

Bricks and Kits: the mechanism for stable conversation evaluation

Freeform prompts produce inconsistent evaluation results - scores drift, output shapes change, and you can't tell whether coaching improved anything or whether the rubric moved. Bricks define a locked evaluation schema: one question, one output type. Kits group them into reusable evaluation workflows. The result is schema-stable conversation analysis you control.

Read post