Semarize

CRM & Data

Snowflake — Conversation Data in a Cloud Warehouse Built for It

Snowflake separates storage from compute, so you can land terabytes of conversation data and only spin up query power when someone actually needs it. The VARIANT type treats semi-structured JSON as a first-class column type — no ETL-time flattening required. Zero-copy cloning lets you spin up a full copy of your database for pipeline testing without doubling your storage bill. And time travel gives you point-in-time recovery when an accidental DELETE or a bad pipeline run overwrites production data.

What you'll learn

  • When Snowflake is the right warehouse for conversation data — and when it isn't
  • The three architectural decisions that determine your cost and query performance
  • Schema design using VARIANT for semi-structured fields and TIMESTAMP_NTZ for time columns
  • Stage + COPY INTO loading, Snowpipe, and when to use each
  • Operational playbook — credit monitoring, pruning, clones, and masking policies
  • Analysis patterns with Sigma, Snowsight, dbt + Looker

Fit check

When to Choose Snowflake

Snowflake is not the only warehouse, and it is not always the cheapest. But for conversation intelligence workloads, it checks every box that matters. Choose Snowflake when:

You need separation of storage and compute

Scale query power without paying for idle storage. Land all your conversation data in cheap cloud storage and only spin up warehouses when someone runs a query or a pipeline loads new rows.

Multiple teams query the same data at different frequencies

RevOps runs daily dashboards, data science runs weekly model training, engineering runs ad-hoc debugging queries. Each team gets their own warehouse with independent sizing and auto-suspend, reading from the same tables without contention.

You want safe experimentation without duplicating data

Zero-copy cloning lets you create a full copy of your database in seconds with zero additional storage cost. Test pipeline changes against production data, then drop the clone when you are done. Time travel lets you recover from bad writes by querying data as it existed at any point in the retention window.

Semi-structured output is a first-class citizen

Conversation platforms and evaluation frameworks produce JSON. Snowflake's VARIANT type stores it natively - you can query nested paths with dot notation, lateral flatten arrays, and cast to typed columns when you are ready. No ETL-time schema enforcement required.

You are already on Snowflake for other analytical workloads

If your finance, marketing, or product data already lives in Snowflake, adding conversation signals alongside means fewer cross-platform joins, shared governance policies, and a single BI connection layer.

Data sharing is important

Snowflake's Secure Data Sharing lets you share live signal datasets across accounts or with partners - no file copies, no ETL, no stale snapshots. Useful when a parent company or consultancy needs access to conversation intelligence from multiple business units.

Architecture

Three Decisions Before Your First COPY INTO

Most Snowflake cost overruns and performance problems come from getting one of these three things wrong early. Nail them upfront and you will avoid the most common pain points.

1

Warehouse sizing

Start with X-Small. It handles thousands of rows per COPY INTO and most analytical queries comfortably. Scale up (Small, Medium) for heavy transforms - complex joins across millions of rows, dbt models with multiple CTEs. Scale out (multi-cluster) when concurrent users or dashboards create queue times. Never scale up for more data volume - Snowflake processes more data by scanning more micropartitions in parallel, not by needing a bigger machine. Set AUTO_SUSPEND to 60 seconds for development warehouses to avoid burning credits while you are reading documentation between queries.

2

Clustering vs micropartitions

Snowflake automatically organises data into micropartitions as it is loaded - each partition stores a contiguous range of rows in a compressed columnar format. For most conversation tables under 10TB, the natural ingestion order on call_timestamp provides good-enough partition pruning for time-filtered queries. Only add an explicit clustering key when you see scan-heavy queries that filter on non-time columns (like signal_name or account_id) and SYSTEM$CLUSTERING_INFORMATION confirms poor overlap ratios. Clustering is a background maintenance cost - do not add it speculatively.

3

Time travel retention

Default time travel retention is 1 day. For the conversation_signals table - where an accidental DELETE or a bad scoring pipeline run can wipe production data - extend to 90 days. This lets you recover any point-in-time state within that window using AT or BEFORE syntax. For raw transcript tables, 1 to 7 days usually suffices since transcripts are immutable after initial load. Know that extended time travel retention increases storage costs proportionally - every changed micropartition is retained for the full retention period.

Schema design

Schema Design

A two-table model keeps raw conversation data separate from extracted signals. This means you can re-evaluate conversations with updated scoring logic without touching the source data. Use Snowflake-native types throughout.

conversations

One row per call

Columns

call_idSTRINGPrimary key. Unique identifier from the source platform.
source_platformSTRINGWhere the call originated - gong, zoom, teams, etc.
call_timestampTIMESTAMP_NTZWhen the call started. NTZ avoids timezone conversion surprises in queries.
duration_secondsINTEGERCall duration in seconds.
participantsVARIANTJSON array of participant objects - names, roles, emails. Query with lateral flatten.
deal_idSTRINGForeign key to CRM deal/opportunity.
account_idSTRINGForeign key to CRM account/company.
transcript_textSTRINGFull transcript text. Large - consider separate table if >16MB per row.
metadataVARIANTFlexible JSON for source-specific fields. Query paths with dot notation.
ingested_atTIMESTAMP_NTZWhen this row was loaded into Snowflake. Useful for pipeline debugging.

conversation_signals

One row per extracted signal

Columns

signal_idSTRINGPrimary key. Unique identifier for this signal row.
call_idSTRINGForeign key to conversations table.
signal_nameSTRINGName of the extracted signal - overall_score, risk_flag, pain_point, budget_mentioned.
signal_valueSTRINGThe extracted value. Cast to appropriate type in queries (e.g. signal_value::FLOAT).
signal_typeSTRINGType hint - score, boolean, category, extraction, count.
confidenceFLOATConfidence score from the evaluation (0.0 to 1.0).
scoring_versionSTRINGVersion of the evaluation framework that produced this signal.
evaluated_atTIMESTAMP_NTZWhen this signal was generated.
For Semarize-specific modeling, versioning, and grounding lineage best practices, see the canonical modeling guide.

Data loading

Loading Strategy

Stage + COPY INTO is the backbone of Snowflake data loading. It is the pattern most teams should start with and the one most teams should stick with unless they have a specific latency requirement that demands something else.

Recommended: Stage + COPY INTO

Scheduled, predictable, idempotent

1

Land files in a stage

Write JSONL or Parquet files to a named internal stage or external stage (S3, GCS, Azure Blob). External stages are preferable when your upstream pipeline already writes to cloud storage.

2

Run COPY INTO on schedule

Start with daily, tighten to every 15 minutes as needs grow. Use Snowflake Tasks or an external scheduler (Airflow, dbt Cloud, cron). COPY INTO is idempotent - it tracks loaded file names and skips duplicates automatically on re-run.

3

Use FILE_FORMAT for JSON handling

Define a named FILE_FORMAT with STRIP_OUTER_ARRAY = TRUE for JSON arrays, TYPE = 'PARQUET' for columnar files. This handles flattening at ingest time without custom transformation code.

4

Validate with VALIDATION_MODE

On first load or after schema changes, run COPY INTO with VALIDATION_MODE = 'RETURN_ERRORS' to preview failures without actually loading. Catches type mismatches and malformed records before they hit your table.

When to use Snowpipe

Event-driven, sub-minute latency

You need sub-minute latency from file landing to queryable row - operational dashboards, real-time alerting on risk signals

Your upstream system already writes to S3/GCS/Azure Blob and can fire SQS, SNS, or Event Grid notifications on new files

You accept the per-file serverless compute charge - Snowpipe uses its own compute pool, billed per file processed, which adds up faster than scheduled warehouse compute for high-volume streams

Also worth knowing: The Snowflake Kafka connector writes directly from Kafka topics to Snowflake tables - useful if your event bus is already Kafka-based. For CDC patterns where you need to react to row-level changes in your conversation tables, Snowflake Streams + Tasks provide a native change data capture mechanism without external tooling.

Operations

Operational Playbook

Snowflake-specific operational concerns that matter for conversation data pipelines. These are the things that separate a working prototype from a production system that does not surprise you with a $10k credit bill.

Credit consumption monitoring

Query ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY to track credit spend per warehouse per day. Set resource monitors at the account or warehouse level with email alerts at 80% and hard suspend at 100% of your monthly budget. Credits burn whether queries succeed or fail - a runaway query on a Large warehouse costs the same as a productive one.

Warehouse auto-suspend

Set AUTO_SUSPEND to 60 seconds for development and staging warehouses. Every second of idle compute burns credits. For production warehouses serving dashboards, 300 seconds (5 minutes) avoids frequent cold starts that add latency to the first query after idle. Never leave AUTO_SUSPEND at the default 10 minutes for dev - it is the single easiest cost saving.

Micropartition pruning

Check pruning efficiency with SYSTEM$CLUSTERING_INFORMATION('conversations', '(call_timestamp)'). If the average_overlap ratio is high or more than 50% of micropartitions are scanned for time-filtered queries, your natural ordering may have degraded from heavy UPDATE or DELETE operations. Recluster or consider adding an explicit clustering key on call_timestamp.

VARIANT query performance

Querying VARIANT paths (e.g. metadata:source_type::STRING) is slower than querying typed columns because Snowflake must deserialise the JSON at query time. If you notice repeated lateral flatten operations in slow queries or your dashboards are hammering the same VARIANT paths, create a scheduled Task that materialises those paths into typed columns in a downstream table.

Zero-copy clones for testing

Clone the entire database (CREATE DATABASE dev_clone CLONE production) for pipeline testing. The clone shares storage with the source until data diverges - so it costs nothing until you start writing to it. Test schema migrations, new COPY INTO patterns, or dbt model changes against real data. Drop the clone when done.

Data retention and compliance

Use time travel + fail-safe for recovery. Implement dynamic data masking policies on transcript_text and participants columns to redact PII for roles that should not see it. Use row access policies to restrict which accounts or teams a given role can query. For GDPR right-to-deletion, DELETE the rows and let time travel retention expire - after the retention period plus the 7-day fail-safe window, the data is permanently gone.

BI patterns

Analysis Examples

Three patterns that work well with Snowflake's native BI ecosystem. Each one connects conversation signals to business outcomes using tools that integrate natively with Snowflake.

Signal to win-rate correlation (Sigma / Tableau)

Join conversation_signals to your CRM opportunity table on deal_id. Pivot signal names into columns using Snowflake's PIVOT or conditional aggregation. Calculate win rates by signal quartile - e.g. deals where the discovery_score signal was in the top 25% closed at 3x the rate of the bottom 25%. Sigma connects natively via Snowflake's partner connect. Tableau uses the Snowflake JDBC driver - point it at a view or a dbt mart and let analysts self-serve.

Rep cohort scoring trend (Snowsight dashboards)

Use Snowflake's built-in Snowsight dashboarding to plot average signal scores by rep and month. Filter by scoring_version to compare framework iterations side by side - did the new discovery rubric actually shift scores? Snowsight SQL worksheets with chart tiles make this a 10-minute setup. No external BI tool needed for internal team reviews.

Deal risk dashboard (dbt + Looker)

dbt models in your transform layer aggregate risk signals per deal - count of risk_flag = true, average confidence, latest scoring_version. A final mart table joins these to CRM stage and close date. Looker connects to the gold layer and provides self-service filtering by team, pipeline stage, and time period. Sales leadership gets a live view of which deals have the most risk signals without writing SQL.

Structured signals

Why VARIANT and Time Travel Matter for Signal Storage

Snowflake's semi-structured data handling gives you a unique advantage when storing conversation intelligence outputs — you can keep the full payload and the extracted typed columns side by side.

Store the full API response as VARIANT in the conversations table. Snowflake compresses VARIANT efficiently and lets you query nested fields with dot notation (e.g., raw_response:signals[0].confidence::FLOAT) without needing to pre-extract every field.

Extract the fields you query frequently into typed columns in conversation_signals. Snowflake can prune micropartitions on typed FLOAT and VARCHAR columns far more effectively than scanning VARIANT, so your BI queries stay under a second even at scale.

Use time travel (AT / BEFORE syntax) to compare signal distributions before and after a scoring framework change. Instead of maintaining separate version tables, query the same table at two timestamps and diff the results directly in SQL.

Zero-copy cloning lets you create a full test environment of your signal tables without doubling storage costs. Clone the production schema, re-run evaluation with updated logic, and compare outputs before promoting to production.

Modeling guide: For the full schema, versioning strategy, and grounding lineage patterns, see Modeling Semarize Outputs.
CRM sync: For pushing signals back to your CRM, see the HubSpot guide or Salesforce guide.

FAQ

Frequently Asked Questions

Explore

Explore Semarize