Product

Bricks and Kits: the mechanism for stable conversation evaluation

April 1, 2026·6 min read·Alex Handsaker

Most businesses using AI for call review are doing it as a one-call-at-a-time exercise: a manager opens a recording, drops the transcript into an AI tool, asks how it went, gets a review back. It's useful. It's also just sampling - you only get insight on the calls someone decided to open, and the calls no one opened tell you nothing.

The shift that changes this is moving from reviewing calls individually to monitoring them at scale - running the same evaluation against every call automatically and getting signal back that tells you which ones need attention without anyone having to open them first. A rep whose discovery quality has dropped. A cluster of calls where the same objection is landing unaddressed. A compliance gap appearing in a specific segment. These patterns are invisible when you're reviewing calls one at a time; they only emerge when you can see across all of them.

That shift - from individual call review to monitoring at scale - is what a structured evaluation schema enables. And the reason most teams struggle to make it is that ad-hoc AI review doesn't produce consistent enough outputto compare across calls, reps, or time: if your evaluation scores keep changing between runs and the conversations haven't changed much, the problem sits in the schema, not the AI.

What an evaluation schema actually is

An evaluation schema is the repeatable set of typed questions and scoring logic applied to a conversation - consistently, every time. A summary instruction or a prompt blob isn't a schema; a schema is something you can version, test, and execute the same way across every call, every rep, every week. Most systems don't care about intent - they care about output format, and if evaluation output is free-form text, you can't reliably compare results across time or separate “the conversation changed” from “the evaluation logic changed.”

Hand-sketched comparison of prompt review producing different output shapes and an evaluation schema producing the same fields across calls for comparison over time. — A schema makes conversation evaluation comparable across calls, reps, and time.

Why prompt-based evaluation drifts

When evaluation logic lives in prompt text, every edit is a new experiment: small wording changes shift what the model emphasises, infers, and ignores, and teams end up iterating on prompts looking for better scores when what's actually changing is the rubric.

The consequence shows up in coaching. Teams buy conversation intelligence to improve rep performance, then build coaching programmes around what the model said this week - which isn't quite what it said last week. When the evaluation logic drifts, you lose the ability to tell whether coaching improved anything or whether the rubric moved. Most AI scorecards compound this by measuring rep behaviour and script compliance rather than what the buyer actually understood, and those two things look similar from a prompt perspective but measure completely different outcomes.

What a Brick is

A Brickis a single evaluation unit: one question about a conversation, with a defined output type. “Was a next step agreed with a specific owner and date?” - yes/no. “How specific was the pain the buyer described?” - score. “What competitors were mentioned?” - list. “What stage of the buying process is this call?” - category.

Every Brick also returns a confidence score and evidence spans - the exact quotes from the transcript that support the answer, so every output is traceable back to the source. Because each Brick evaluates exactly one thing and returns exactly one output type, the shape of the answer never changes between runs; Salesforce field mappings, automation triggers, and reporting queries built on top don't break when models update.

Hand-sketched Semarize Brick diagram showing a call transcript flowing into one Brick question and returning a typed yes or no answer with confidence and evidence quote. — A Brick is one question, one output type, and one traceable evidence standard.

What a Kit is

A Kit is a named collection of Bricks designed to evaluate a specific type of conversation for a specific purpose. A discovery quality Kit might contain eight Bricks covering pain specificity, stakeholder identification, budget signals, timeline mentions, and next step commitment. A deal risk Kit covers forecast signals, competitor mentions, and procurement blockers. A coaching Kit covers framework adherence, question quality, and buyer understanding signals.

The Kit is what you reference in the API call: send a transcript with a Kit ID, and Semarize runs all the Bricks in that Kit and returns a single structured JSON response. Kits are reusable - once defined, a discovery quality Kit runs the same way against every discovery call, for every rep, every segment, every quarter. Criteria changes are explicit version updates, not silent behavioural shifts.

Knowledge grounding in Kits

Kits can be attached to a knowledge base - a collection of documents that provide context for evaluation: pricing sheets, qualification playbooks, competitive battle cards, ICP definitions. When a Kit has a knowledge base attached, Bricks evaluate against your documents rather than against what the model assumes is generally true.

“Was the pricing quoted correctly?” can only return a reliable answer if the AI knows what your pricing actually is. Attaching your rate card grounds the evaluation in your reality, not in a model inference about pricing in general.

Hand-sketched diagram showing a Semarize Discovery Kit made of Bricks, connected to a knowledge base with playbook, ICP, and pricing documents, returning structured JSON to CRM and dashboards. — Kits group Bricks, attach the relevant knowledge, and return structured data to the systems that need it.

What stable evaluation enables

The difference between schema-controlled evaluation and ad-hoc prompts shows up across every system the output touches. RevOps teams get CRM enrichment they can rely on: budget signals, timeline mentions, competitor flags, and next step evidence pushed automatically to Salesforce or HubSpot without anyone reading a transcript first. Coaching teams get data that's comparable over time - discovery depth by rep, objection patterns, next step conversion rates - all measured against the same Bricks week after week. QA and compliance teams get 100% call coverage with evaluation logic that doesn't shift when a model updates.

The output in every case is the same shape: structured data with consistent fields, confidence scores, and evidence. What changes is which Bricks are in the Kit and what the rest of the stack does with the result.

Common questions

How do Bricks and Kits differ from a prompt template?

A prompt template is free-form text that gets edited over time. A Brick is a typed question with a defined output structure - it produces the same shape of answer every time it runs, regardless of what's in the transcript. Kits bundle Bricks into reusable evaluation frameworks you version rather than rewrite.

How many Bricks should a Kit contain?

Kits work best when they're purpose-built. A discovery quality Kit with 6-10 Bricks gives sharper signal than a general “evaluate everything” Kit with 30. Start narrow, add Bricks when you identify gaps in what you're measuring.

Can I run multiple Kits against the same conversation?

Yes. A deal risk Kit, a coaching Kit, and a compliance Kit can all run against the same transcript independently. Each returns its own structured response; you decide which Kits to run based on conversation type and what you need from the analysis.

What happens if the conversation doesn't contain evidence for a Brick?

The Brick returns its best assessment with a low confidence score and empty evidence spans. A yes/no Brick might return false with confidence 0.3 and no evidence - which is itself useful information: the criterion wasn't addressed in the conversation.

Semarize is a conversational intelligence API. Define your evaluation schema with Bricks and Kits, send conversations, get structured data back.

Start building →

Continue reading

AI Scorecards Don't Disagree. Your Prompt Does.

Inconsistent AI scorecards aren't an AI problem - they're a process failure. Freeform prompts ask the model to re-interpret evaluation criteria on every run, and that interpretation drifts with phrasing, model updates, and context. The fix is an evaluation contract: a locked schema with defined output types that produces the same result on the same call, every time.

Read post

Developers

Conversation Intelligence for Developers: Don't Build a Fragile Pipeline, Don't Buy a Black Box

Most teams don't fail to add conversation intelligence because the model is bad; they fail because the integration is fragile and unstructured. The fix isn't a better LLM pipeline or a platform API you can't control. It's a layer that takes a transcript, runs it against a versioned Kit, and returns deterministic typed JSON you can test, version, and route into your product.

Read post