CRM & Data
Databricks — Lakehouse Architecture for Conversation Intelligence
Databricks combines the flexibility of data lakes with the reliability of Delta Lake. Land raw conversation data in bronze, transform into typed signals in silver, and serve analytics from gold — all with ACID transactions, schema enforcement, and Unity Catalog governance.
What you'll learn
- When Databricks is the right platform for conversation data — and what makes it distinct from Snowflake or BigQuery
- The three architectural decisions that determine your pipeline reliability and cost
- Medallion schema design with bronze, silver, and gold Delta tables
- Auto Loader ingestion, MERGE INTO dedup, and structured streaming patterns
- Operational playbook — OPTIMIZE, Z-ORDER, Unity Catalog ACLs, cluster vs warehouse costs
- Analysis patterns with MLflow, Databricks SQL dashboards, and Unity Catalog lineage
Fit check
When to Choose Databricks
Databricks is not just a warehouse — it is a lakehouse. That distinction matters when your conversation data needs to live alongside ML training data, product analytics, and other data assets. Choose Databricks when:
You're building a lakehouse
Conversation data sits alongside ML training data, product analytics, and other data assets in a unified architecture. No separate warehouse for BI and separate lake for data science - one platform, one governance layer, one set of tables.
Delta Lake gives you reliability on cloud storage
ACID transactions, time travel, and schema enforcement on top of S3, ADLS, or GCS. Your data lake becomes reliable - no more corrupt reads from concurrent writes, no more "oops, we overwrote production" without a recovery path.
Auto Loader handles continuous ingestion
Auto Loader watches a cloud storage path and ingests new files automatically with exactly-once guarantees and checkpoint tracking. No custom scheduling code, no file-tracking tables to maintain, no missed files when your cron job fails at 3am.
Unity Catalog provides centralized governance
Row-level security, column masking, lineage tracking, and audit logs across all your data. Define permissions once in Unity Catalog and they apply to notebooks, SQL warehouses, and jobs uniformly - not per-tool ACLs that drift out of sync.
You have ML/AI workloads
Databricks notebooks, MLflow experiment tracking, and Feature Store let you build signal models directly on the same data. Train a deal-outcome predictor on silver-layer features, log experiments in MLflow, register the best model, and serve predictions - all without moving data to a separate ML platform.
Multi-cloud deployment matters
Databricks runs on AWS, Azure, and GCP. Your lakehouse architecture and Delta tables are portable across clouds. Useful for organizations with multi-cloud mandates or teams that want to avoid lock-in to a single cloud provider's warehouse.
Architecture
Three Lakehouse Decisions to Lock In Early
Most lakehouse pipeline failures trace back to getting one of these three things wrong early. Nail them upfront and your medallion architecture will scale cleanly as data volume and team size grow.
Schema evolution strategy
Auto Loader can add new columns automatically (mergeSchema=true). Delta Lake supports additive changes without rewriting data. Decide upfront: do you allow automatic schema evolution (convenient but can introduce unexpected columns from upstream API changes) or require explicit schema changes (safer, more work)? A middle ground is schemaEvolutionMode="rescue" - unexpected fields route to a _rescued_data column instead of failing the pipeline or silently adding columns. Monitor rescued data to catch upstream changes before they affect your silver layer.
Medallion layer discipline
Bronze = raw JSON/Parquet as ingested, append-only, no transforms. Silver = cleaned, typed, deduplicated conversation records and signals with proper column types and constraints. Gold = aggregated views and materialized tables for BI consumption. The discipline is in keeping each layer's contract clear: don't do analytics on bronze (it has no schema guarantees), don't put raw data in gold (it breaks dashboards when upstream changes). Silver is where your MERGE INTO deduplication, type casting, and data quality assertions live.
Unity Catalog governance from day one
Enable Unity Catalog before you have data, not after. Define your catalog → schema → table hierarchy, set ownership and permissions, enable audit logging. Retrofitting governance after data is flowing is significantly harder - you end up with hive_metastore tables that lack lineage tracking, ad-hoc permissions that don't audit, and no column-level access control. Start with Unity Catalog and every table, view, and function gets lineage, access control, and audit logging automatically.
Schema design
Schema Design
Bronze, silver, and gold Delta tables following medallion architecture. Each layer has a clear contract: bronze is raw and append-only, silver is cleaned and typed, gold is aggregated for consumption.
bronze.conversations_raw
Raw ingestion — append-only, no dedup
Columns
silver.conversations
One row per call — cleaned, typed, deduplicated
Columns
silver.conversation_signals
One row per extracted signal
Columns
Data loading
Loading Strategy
Auto Loader into bronze, MERGE INTO for silver dedup. This is the pattern most teams should start with and the one that scales without requiring a rewrite as data volume grows.
Recommended: Auto Loader → MERGE
Continuous, exactly-once, schema-aware
Configure Auto Loader on your landing zone
Point Auto Loader at your cloud storage path (S3, ADLS, GCS) where upstream systems drop conversation files. Auto Loader processes new files automatically, tracks checkpoints so it never reprocesses a file, and handles schema inference on the first run.
Land raw payloads in bronze
Write the full JSON payload as a STRING column to bronze.conversations_raw. No parsing, no transforms - bronze is your raw archive. Auto Loader adds _metadata automatically so you can trace any row back to its source file.
MERGE INTO silver for dedup and updates
Parse bronze JSON, cast to typed columns, and MERGE INTO silver.conversations on call_id. For signals, MERGE INTO silver.conversation_signals on signal_id. MERGE handles both inserts (new records) and updates (re-evaluated signals) in a single atomic operation.
Z-ORDER and OPTIMIZE on silver
Run OPTIMIZE on silver tables to compact small files into ~1GB targets. Z-ORDER on call_id and call_timestamp to co-locate related data for fast predicate pushdown. Auto Loader can create many small files - OPTIMIZE is not optional for read performance.
Structured Streaming for continuous processing
readStream + writeStream with Delta Lake
Use readStream + writeStream for continuous bronze-to-silver transformation without scheduling - new data flows through automatically as it arrives
Trigger.availableNow() gives you micro-batch semantics that act like a cron job but with streaming guarantees - useful when you want batch-like scheduling with exactly-once processing
Delta Lake as both source and sink means full ACID guarantees end-to-end - no partial writes, no phantom reads from concurrent queries
Operations
Operational Playbook
Databricks-specific operational concerns that matter for conversation data pipelines. These are the things that separate a working notebook from a production system that does not surprise you with a runaway cluster bill.
Schema evolution monitoring
When Auto Loader detects new columns, it logs schema changes in the Spark UI event log. Unexpected columns may indicate upstream API changes that need investigation. Use schemaEvolutionMode="rescue" to route unexpected fields to a _rescued_data column instead of failing the pipeline or silently adding columns. Monitor the _rescued_data column - if it starts filling up, your upstream source has changed and your silver-layer transforms may need updating.
File compaction (OPTIMIZE)
Auto Loader creates files matching the input size, which can be small - thousands of tiny Parquet files from frequent webhook deliveries. Run OPTIMIZE on silver tables daily to compact into ~1GB files. This dramatically improves read performance because Delta Lake reads fewer, larger files instead of many small ones. Z-ORDER on frequently-filtered columns (call_id, call_timestamp) during OPTIMIZE to co-locate related data.
Unity Catalog ACLs
Set table-level GRANT permissions: data engineers get ALL PRIVILEGES on bronze and silver schemas, analysts get SELECT on gold only. Use dynamic views for row-level security based on user attributes - e.g. reps only see their own conversations, managers see their team. Column masking policies redact transcript_text and participant emails for roles that should not see PII.
Cluster vs SQL warehouse costs
All-purpose clusters charge per DBU-hour whether busy or idle - do not leave them running overnight. Use job clusters for production ETL: they spin up, process, and terminate automatically. Use SQL warehouses for BI queries and ad-hoc analysis: they auto-start when a query arrives and auto-stop after an idle period. The most common cost mistake is running all-purpose clusters for workloads that should be on job clusters or SQL warehouses.
Delta Lake time travel
Default retention is 30 days (configurable via delta.logRetentionDuration). Use DESCRIBE HISTORY to see every operation on a table - who changed what, when, and how many rows were affected. Restore to previous versions with RESTORE TABLE AS OF VERSION. Useful for debugging bad data loads, rolling back schema changes, and auditing signal re-evaluations.
Medallion layer validation
Add expectations at the silver boundary to catch data quality issues before they propagate to gold. Check for null call_ids, future timestamps, duplicate signal_ids, and confidence values outside 0-1 range. In DLT pipelines, use EXPECT constraints (Assert, Drop, Fail). In custom pipelines, add explicit checks in your bronze-to-silver transform and log violations to a quarantine table.
BI patterns
Analysis Examples
Three patterns that leverage Databricks-native tools. Each one connects conversation signals to business outcomes using capabilities that only exist in a lakehouse platform.
Signal to deal outcome model (MLflow)
Train a scikit-learn or XGBoost model in a Databricks notebook that predicts deal outcomes from silver-layer signal features - average confidence, signal count by type, risk flag frequency. Log experiments in MLflow with automatic parameter and metric tracking. Register the best model in the Unity Catalog model registry and serve predictions via a SQL function that analysts can call directly in their gold-layer views. Model training and serving stay in the same platform - no data export, no separate ML infrastructure.
Rep cohort scoring trend (Databricks SQL dashboard)
Use Databricks SQL to create a dashboard on gold-layer views. Plot average signal scores by rep and month, filter by scoring_version to compare framework iterations side by side. SQL warehouses auto-start when the dashboard is accessed and auto-stop after the idle period - no standing cluster cost for BI. Share with stakeholders via workspace permissions. Built-in alerting can notify when a rep's score drops below a threshold.
Signal lineage audit (Unity Catalog)
Unity Catalog automatically tracks column-level lineage across notebooks, jobs, and SQL queries. Trace a signal value from gold dashboard view → silver conversation_signals table → bronze raw file → source system. Useful for debugging unexpected scores ("why did this deal get flagged as high risk?") and for compliance audits ("which data sources contributed to this decision?"). No manual lineage tracking required - Unity Catalog captures it from the compute graph.
Structured signals
Medallion Layers and MLflow for Signal Intelligence
The lakehouse architecture gives you capabilities that flat warehouses don't — schema evolution, time travel, and native ML integration change how you store and use conversation signals.
Bronze tables accept raw JSON payloads via Auto Loader with schema evolution enabled. When a new signal field appears in the API response, Auto Loader merges the new column into the bronze table automatically — no schema migration needed, no pipeline breakage. Set schemaEvolutionMode to "addNewColumns" and monitor the _rescued_data column for unexpected format changes.
MERGE INTO in the silver layer handles upserts with ACID guarantees. Use MERGE INTO silver.conversation_signals USING bronze_batch ON call_id AND signal_name to deduplicate while preserving the full Delta transaction log. Each merge creates a new version — you can DESCRIBE HISTORY and RESTORE to any previous state if a bad batch lands.
Z-ORDER on call_id + call_timestamp in conversation_signals co-locates related data within Parquet files. This means a query like "all signals for deal X in Q4" skips entire files rather than scanning the full table. Run OPTIMIZE with Z-ORDER weekly on silver tables — the compaction and reordering typically reduces query time by 60-80% on tables over 10M rows.
MLflow integration lets you train signal-to-outcome models directly on your silver tables. Register a model that predicts deal outcome from signal features, serve it as a SQL function (ai_predict), and materialize predictions into a gold-layer table that your Databricks SQL dashboard reads. The entire pipeline stays inside the lakehouse.
FAQ
Frequently Asked Questions
Explore