Semarize

CRM & Data

Databricks — Lakehouse Architecture for Conversation Intelligence

Databricks combines the flexibility of data lakes with the reliability of Delta Lake. Land raw conversation data in bronze, transform into typed signals in silver, and serve analytics from gold — all with ACID transactions, schema enforcement, and Unity Catalog governance.

What you'll learn

  • When Databricks is the right platform for conversation data — and what makes it distinct from Snowflake or BigQuery
  • The three architectural decisions that determine your pipeline reliability and cost
  • Medallion schema design with bronze, silver, and gold Delta tables
  • Auto Loader ingestion, MERGE INTO dedup, and structured streaming patterns
  • Operational playbook — OPTIMIZE, Z-ORDER, Unity Catalog ACLs, cluster vs warehouse costs
  • Analysis patterns with MLflow, Databricks SQL dashboards, and Unity Catalog lineage

Fit check

When to Choose Databricks

Databricks is not just a warehouse — it is a lakehouse. That distinction matters when your conversation data needs to live alongside ML training data, product analytics, and other data assets. Choose Databricks when:

You're building a lakehouse

Conversation data sits alongside ML training data, product analytics, and other data assets in a unified architecture. No separate warehouse for BI and separate lake for data science - one platform, one governance layer, one set of tables.

Delta Lake gives you reliability on cloud storage

ACID transactions, time travel, and schema enforcement on top of S3, ADLS, or GCS. Your data lake becomes reliable - no more corrupt reads from concurrent writes, no more "oops, we overwrote production" without a recovery path.

Auto Loader handles continuous ingestion

Auto Loader watches a cloud storage path and ingests new files automatically with exactly-once guarantees and checkpoint tracking. No custom scheduling code, no file-tracking tables to maintain, no missed files when your cron job fails at 3am.

Unity Catalog provides centralized governance

Row-level security, column masking, lineage tracking, and audit logs across all your data. Define permissions once in Unity Catalog and they apply to notebooks, SQL warehouses, and jobs uniformly - not per-tool ACLs that drift out of sync.

You have ML/AI workloads

Databricks notebooks, MLflow experiment tracking, and Feature Store let you build signal models directly on the same data. Train a deal-outcome predictor on silver-layer features, log experiments in MLflow, register the best model, and serve predictions - all without moving data to a separate ML platform.

Multi-cloud deployment matters

Databricks runs on AWS, Azure, and GCP. Your lakehouse architecture and Delta tables are portable across clouds. Useful for organizations with multi-cloud mandates or teams that want to avoid lock-in to a single cloud provider's warehouse.

Recommended default ingestion pattern: Auto Loader (bronze) → MERGE INTO (silver) → views / materialized tables (gold). Auto Loader watches a cloud storage path and ingests new files automatically with exactly-once semantics and checkpoint tracking.

Architecture

Three Lakehouse Decisions to Lock In Early

Most lakehouse pipeline failures trace back to getting one of these three things wrong early. Nail them upfront and your medallion architecture will scale cleanly as data volume and team size grow.

1

Schema evolution strategy

Auto Loader can add new columns automatically (mergeSchema=true). Delta Lake supports additive changes without rewriting data. Decide upfront: do you allow automatic schema evolution (convenient but can introduce unexpected columns from upstream API changes) or require explicit schema changes (safer, more work)? A middle ground is schemaEvolutionMode="rescue" - unexpected fields route to a _rescued_data column instead of failing the pipeline or silently adding columns. Monitor rescued data to catch upstream changes before they affect your silver layer.

2

Medallion layer discipline

Bronze = raw JSON/Parquet as ingested, append-only, no transforms. Silver = cleaned, typed, deduplicated conversation records and signals with proper column types and constraints. Gold = aggregated views and materialized tables for BI consumption. The discipline is in keeping each layer's contract clear: don't do analytics on bronze (it has no schema guarantees), don't put raw data in gold (it breaks dashboards when upstream changes). Silver is where your MERGE INTO deduplication, type casting, and data quality assertions live.

3

Unity Catalog governance from day one

Enable Unity Catalog before you have data, not after. Define your catalog → schema → table hierarchy, set ownership and permissions, enable audit logging. Retrofitting governance after data is flowing is significantly harder - you end up with hive_metastore tables that lack lineage tracking, ad-hoc permissions that don't audit, and no column-level access control. Start with Unity Catalog and every table, view, and function gets lineage, access control, and audit logging automatically.

Schema design

Schema Design

Bronze, silver, and gold Delta tables following medallion architecture. Each layer has a clear contract: bronze is raw and append-only, silver is cleaned and typed, gold is aggregated for consumption.

bronze.conversations_raw

Raw ingestion — append-only, no dedup

Columns

raw_payloadSTRINGFull JSON payload from the source platform. No parsing, no transforms.
source_fileSTRINGPath to the source file in cloud storage. Enables traceability from row back to origin.
ingested_atTIMESTAMPWhen Auto Loader processed this file. Automatically populated.
_metadataSTRUCTAuto Loader metadata - file_path, file_name, file_size, file_modification_time. Added automatically.

silver.conversations

One row per call — cleaned, typed, deduplicated

Columns

call_idSTRINGPrimary key. Unique identifier from the source platform.
source_platformSTRINGWhere the call originated - gong, zoom, teams, etc.
call_timestampTIMESTAMPWhen the call started. Z-ORDER target for query performance.
duration_secondsINTCall duration in seconds.
participantsARRAY<STRUCT>Array of structs: name STRING, email STRING, role STRING.
deal_idSTRINGForeign key to CRM deal/opportunity.
account_idSTRINGForeign key to CRM account/company.
transcript_textSTRINGFull transcript text.
metadataMAP<STRING, STRING>Flexible key-value map for source-specific metadata.
ingested_atTIMESTAMPWhen this row was written to silver.

silver.conversation_signals

One row per extracted signal

Columns

signal_idSTRINGPrimary key. Unique identifier for this signal row.
call_idSTRINGForeign key to silver.conversations.
signal_nameSTRINGName of the extracted signal - overall_score, risk_flag, pain_point.
signal_valueSTRINGThe extracted value. Cast to appropriate type in queries.
signal_typeSTRINGType hint - score, boolean, category, extraction, count.
confidenceDOUBLEConfidence score from the evaluation (0.0 to 1.0).
scoring_versionSTRINGVersion of the evaluation framework that produced this signal.
evaluated_atTIMESTAMPWhen this signal was generated.
Gold layer: Views that join silver tables, aggregate signals by deal or rep, and pivot signal names into columns for BI consumption. Keep gold as views or materialized tables - no raw data, no business logic beyond aggregation.
For Semarize-specific modeling, versioning, and grounding lineage best practices, see the canonical modeling guide.

Data loading

Loading Strategy

Auto Loader into bronze, MERGE INTO for silver dedup. This is the pattern most teams should start with and the one that scales without requiring a rewrite as data volume grows.

Recommended: Auto Loader → MERGE

Continuous, exactly-once, schema-aware

1

Configure Auto Loader on your landing zone

Point Auto Loader at your cloud storage path (S3, ADLS, GCS) where upstream systems drop conversation files. Auto Loader processes new files automatically, tracks checkpoints so it never reprocesses a file, and handles schema inference on the first run.

2

Land raw payloads in bronze

Write the full JSON payload as a STRING column to bronze.conversations_raw. No parsing, no transforms - bronze is your raw archive. Auto Loader adds _metadata automatically so you can trace any row back to its source file.

3

MERGE INTO silver for dedup and updates

Parse bronze JSON, cast to typed columns, and MERGE INTO silver.conversations on call_id. For signals, MERGE INTO silver.conversation_signals on signal_id. MERGE handles both inserts (new records) and updates (re-evaluated signals) in a single atomic operation.

4

Z-ORDER and OPTIMIZE on silver

Run OPTIMIZE on silver tables to compact small files into ~1GB targets. Z-ORDER on call_id and call_timestamp to co-locate related data for fast predicate pushdown. Auto Loader can create many small files - OPTIMIZE is not optional for read performance.

Structured Streaming for continuous processing

readStream + writeStream with Delta Lake

Use readStream + writeStream for continuous bronze-to-silver transformation without scheduling - new data flows through automatically as it arrives

Trigger.availableNow() gives you micro-batch semantics that act like a cron job but with streaming guarantees - useful when you want batch-like scheduling with exactly-once processing

Delta Lake as both source and sink means full ACID guarantees end-to-end - no partial writes, no phantom reads from concurrent queries

Also worth knowing: Delta Live Tables (DLT) provides fully managed pipelines with declarative quality expectations (Assert, Drop, Fail) - useful when you want Databricks to handle orchestration, retries, and monitoring. Databricks Workflows orchestrates multi-step pipelines (notebooks, JARs, Python scripts) with dependency management and alerting.

Operations

Operational Playbook

Databricks-specific operational concerns that matter for conversation data pipelines. These are the things that separate a working notebook from a production system that does not surprise you with a runaway cluster bill.

Schema evolution monitoring

When Auto Loader detects new columns, it logs schema changes in the Spark UI event log. Unexpected columns may indicate upstream API changes that need investigation. Use schemaEvolutionMode="rescue" to route unexpected fields to a _rescued_data column instead of failing the pipeline or silently adding columns. Monitor the _rescued_data column - if it starts filling up, your upstream source has changed and your silver-layer transforms may need updating.

File compaction (OPTIMIZE)

Auto Loader creates files matching the input size, which can be small - thousands of tiny Parquet files from frequent webhook deliveries. Run OPTIMIZE on silver tables daily to compact into ~1GB files. This dramatically improves read performance because Delta Lake reads fewer, larger files instead of many small ones. Z-ORDER on frequently-filtered columns (call_id, call_timestamp) during OPTIMIZE to co-locate related data.

Unity Catalog ACLs

Set table-level GRANT permissions: data engineers get ALL PRIVILEGES on bronze and silver schemas, analysts get SELECT on gold only. Use dynamic views for row-level security based on user attributes - e.g. reps only see their own conversations, managers see their team. Column masking policies redact transcript_text and participant emails for roles that should not see PII.

Cluster vs SQL warehouse costs

All-purpose clusters charge per DBU-hour whether busy or idle - do not leave them running overnight. Use job clusters for production ETL: they spin up, process, and terminate automatically. Use SQL warehouses for BI queries and ad-hoc analysis: they auto-start when a query arrives and auto-stop after an idle period. The most common cost mistake is running all-purpose clusters for workloads that should be on job clusters or SQL warehouses.

Delta Lake time travel

Default retention is 30 days (configurable via delta.logRetentionDuration). Use DESCRIBE HISTORY to see every operation on a table - who changed what, when, and how many rows were affected. Restore to previous versions with RESTORE TABLE AS OF VERSION. Useful for debugging bad data loads, rolling back schema changes, and auditing signal re-evaluations.

Medallion layer validation

Add expectations at the silver boundary to catch data quality issues before they propagate to gold. Check for null call_ids, future timestamps, duplicate signal_ids, and confidence values outside 0-1 range. In DLT pipelines, use EXPECT constraints (Assert, Drop, Fail). In custom pipelines, add explicit checks in your bronze-to-silver transform and log violations to a quarantine table.

BI patterns

Analysis Examples

Three patterns that leverage Databricks-native tools. Each one connects conversation signals to business outcomes using capabilities that only exist in a lakehouse platform.

Signal to deal outcome model (MLflow)

Train a scikit-learn or XGBoost model in a Databricks notebook that predicts deal outcomes from silver-layer signal features - average confidence, signal count by type, risk flag frequency. Log experiments in MLflow with automatic parameter and metric tracking. Register the best model in the Unity Catalog model registry and serve predictions via a SQL function that analysts can call directly in their gold-layer views. Model training and serving stay in the same platform - no data export, no separate ML infrastructure.

Rep cohort scoring trend (Databricks SQL dashboard)

Use Databricks SQL to create a dashboard on gold-layer views. Plot average signal scores by rep and month, filter by scoring_version to compare framework iterations side by side. SQL warehouses auto-start when the dashboard is accessed and auto-stop after the idle period - no standing cluster cost for BI. Share with stakeholders via workspace permissions. Built-in alerting can notify when a rep's score drops below a threshold.

Signal lineage audit (Unity Catalog)

Unity Catalog automatically tracks column-level lineage across notebooks, jobs, and SQL queries. Trace a signal value from gold dashboard view → silver conversation_signals table → bronze raw file → source system. Useful for debugging unexpected scores ("why did this deal get flagged as high risk?") and for compliance audits ("which data sources contributed to this decision?"). No manual lineage tracking required - Unity Catalog captures it from the compute graph.

Structured signals

Medallion Layers and MLflow for Signal Intelligence

The lakehouse architecture gives you capabilities that flat warehouses don't — schema evolution, time travel, and native ML integration change how you store and use conversation signals.

Bronze tables accept raw JSON payloads via Auto Loader with schema evolution enabled. When a new signal field appears in the API response, Auto Loader merges the new column into the bronze table automatically — no schema migration needed, no pipeline breakage. Set schemaEvolutionMode to "addNewColumns" and monitor the _rescued_data column for unexpected format changes.

MERGE INTO in the silver layer handles upserts with ACID guarantees. Use MERGE INTO silver.conversation_signals USING bronze_batch ON call_id AND signal_name to deduplicate while preserving the full Delta transaction log. Each merge creates a new version — you can DESCRIBE HISTORY and RESTORE to any previous state if a bad batch lands.

Z-ORDER on call_id + call_timestamp in conversation_signals co-locates related data within Parquet files. This means a query like "all signals for deal X in Q4" skips entire files rather than scanning the full table. Run OPTIMIZE with Z-ORDER weekly on silver tables — the compaction and reordering typically reduces query time by 60-80% on tables over 10M rows.

MLflow integration lets you train signal-to-outcome models directly on your silver tables. Register a model that predicts deal outcome from signal features, serve it as a SQL function (ai_predict), and materialize predictions into a gold-layer table that your Databricks SQL dashboard reads. The entire pipeline stays inside the lakehouse.

Modeling guide: For the full schema, versioning strategy, and grounding lineage patterns, see Modeling Semarize Outputs.
CRM sync: For pushing signals back to your CRM, see the HubSpot guide or Salesforce guide.

FAQ

Frequently Asked Questions

Explore

Explore Semarize