Semarize

CRM & Data

BigQuery — Serverless Analytics for Conversation Intelligence

BigQuery is fully serverless. There is no infrastructure to provision, no clusters to size, and no concurrency knobs to tune. You go from zero to querying terabytes of conversation data without touching a single server. Pay-per-query economics mean you only spend money when you actually run queries — and native integration with Looker, Looker Studio, Connected Sheets, and BigQuery ML means your BI layer is one click away from your warehouse.

Recommended default ingestion pattern

GCS load jobs + scheduled queries. Land your files in Cloud Storage, run free load jobs into BigQuery, then use scheduled queries for downstream transforms. This is the cheapest path — load jobs have zero cost. No streaming surcharge, no per-row fees. Only move to streaming inserts when you genuinely need sub-second latency.

Platform fit

When to Choose BigQuery

BigQuery is the right warehouse for conversation intelligence when these conditions match your team.

You're on GCP or want a serverless warehouse with zero infrastructure management — no clusters to start, stop, resize, or monitor.

Pay-per-query (on-demand) pricing fits your usage. You don't query 24/7, so you only pay when you actually run queries — ~$6.25 per TB scanned, nothing when idle.

You want native integration with Looker, Looker Studio, Connected Sheets, and BigQuery ML (BQML) for in-warehouse machine learning — no drivers, no connectors, no middleware.

Nested and repeated fields (STRUCT and ARRAY) let you model participants and signal arrays directly in columns without joins — one row per conversation, everything inline.

You need petabyte-scale scanning without worrying about cluster sizing or concurrency limits. BigQuery separates storage and compute completely.

Slot-based execution means no warehouse sizing decisions. Google handles autoscaling — you never wait for a warehouse to resume or queue behind other queries.

Architecture choices

Three Decisions Before Your First Load Job

Get these three right early. Everything else is adjustable later.

1

On-demand vs capacity (Editions) pricing

On-demand charges per TB scanned (~$6.25/TB). Editions give you committed slots at a fixed monthly cost with autoscaling. Start on-demand until you understand your query patterns. Evaluate Editions once you're consistently spending $500+/month — committed slots give you predictable billing and priority execution. Don't pre-optimize pricing before you know what your workload looks like.

2

Require partition filters

Enable require_partition_filter on the conversations table from day one. This forces every query to include a WHERE clause on call_timestamp, which prevents accidental full-table scans that cost real money. Partition by call_timestamp (DATE granularity). Without this guard, a single SELECT * from a BI tool or ad hoc exploration can scan your entire table history.

3

Dataset location

Choose US or EU multi-region, or a specific single region (us-central1, europe-west2, etc.). This decision is permanent — data does not move after dataset creation, and cross-region queries are not supported. Pick the region closest to your data sources and BI tools. If your GCS buckets are in us-central1, your BigQuery dataset should be too.

Schema design

Schema Design

Two tables — conversations and conversation_signals — using BigQuery-native types. Partition conversations by call_timestamp (DATE). Cluster on the columns you filter most.

conversations

One row per call · Partition by call_timestamp · Cluster on source_platform, deal_id

Columns

call_idSTRINGPrimary key. Unique identifier from the source platform.
source_platformSTRINGWhere the call originated — gong, zoom, teams, etc. Cluster column.
call_timestampTIMESTAMPWhen the call started. Partition key (DATE granularity).
duration_secondsINT64Call duration in seconds.
participantsARRAY<STRUCT>Nested repeated field: name STRING, email STRING, role STRING. No joins needed.
deal_idSTRINGForeign key to CRM deal/opportunity. Cluster column.
account_idSTRINGForeign key to CRM account/company.
transcript_textSTRINGFull transcript text. Large — avoid selecting in aggregate queries.
metadataJSONSource-specific metadata that varies between platforms. Use JSON for truly variable shapes.
ingested_atTIMESTAMPWhen this row was loaded into BigQuery.

conversation_signals

One row per extracted signal · Cluster on call_id, signal_name

Columns

signal_idSTRINGPrimary key. Unique identifier for this signal row.
call_idSTRINGForeign key to conversations. Cluster column.
signal_nameSTRINGName of the extracted signal — overall_score, risk_flag, pain_point. Cluster column.
signal_valueSTRINGThe extracted value. Cast to appropriate type in queries (SAFE_CAST).
signal_typeSTRINGType hint — score, boolean, category, extraction, count.
confidenceFLOAT64Confidence score from the evaluation (0.0 to 1.0).
scoring_versionSTRINGVersion of the evaluation framework that produced this signal.
evaluated_atTIMESTAMPWhen this signal was generated.
The two-table model keeps raw conversation data separate from extracted signals. Re-evaluate conversations with updated scoring logic without modifying the source data — just append new signal rows with a new scoring_version.

Modeling depth: For Semarize-specific modeling, versioning, and grounding lineage best practices, see the canonical modeling guide.

Data loading

Loading Strategy

Start with GCS load jobs. They are free, reliable, and handle every format you need. Only add complexity when your latency requirements demand it.

GCS load jobs

Recommended

Land files (JSONL, Parquet, Avro) in a Cloud Storage bucket

Run bq load or use the BigQuery client library to create load jobs

Load jobs are FREE — no per-row cost, no streaming surcharge, no compute charges

Schedule with Cloud Scheduler → Cloud Function, or use BigQuery scheduled queries for downstream transforms

Recommended starting point for every team.

When to use streaming

BigQuery Storage Write API for sub-second ingestion latency

Exactly-once semantics available in committed mode — no dedup logic needed

Costs ~$0.05/GB — significantly more expensive than free load jobs

Use only when downstream systems require real-time data availability.

Other ingestion paths

Pub/Sub → Dataflow → BigQuery for fully managed streaming pipelines with Apache Beam

BigQuery Data Transfer Service for scheduled imports from SaaS sources (Google Ads, YouTube, etc.)

Production operations

Operational Playbook

BigQuery-specific operational concerns. These are the things that catch teams off guard after deployment.

Bytes scanned = cost

On-demand pricing charges per TB scanned. Always use column selection — never SELECT *. Preview cost with --dry-run before executing ad hoc queries. A single unfiltered query against a year of transcripts can cost more than a month of well-structured queries.

Partition filter enforcement

Set require_partition_filter = true on conversation tables. This forces WHERE clauses on call_timestamp and prevents accidental full scans. Anyone querying the table must specify a date range. This is the single most effective cost control.

Clustering maintenance

BigQuery auto-reclusters — no manual maintenance, no VACUUM, no OPTIMIZE. But choose cluster columns wisely: pick 1–4 columns you frequently filter on (source_platform, deal_id, signal_name). Cluster order matters — put the most selective column first.

Slot utilization

On Editions, monitor slot usage in INFORMATION_SCHEMA.JOBS_BY_PROJECT. If you're consistently hitting slot limits, increase capacity or set up autoscaling reservations. On-demand users don't need to worry about slots — Google allocates them dynamically.

Query quotas

Default limit is 2,000 queries per day for on-demand users. Monitor usage with INFORMATION_SCHEMA quota views. Most teams querying conversation data never come close, but automated dashboards with frequent refreshes can accumulate fast.

Materialized views

Use materialized views for expensive aggregations that get queried repeatedly — like daily signal summaries or rep performance rollups. BigQuery auto-refreshes them when base data changes. Cost-effective alternative to running the same aggregation query multiple times a day.

BI ecosystem

Analysis Examples

Three ways to turn conversation signals into decisions, all using BigQuery's native BI integrations.

Signal → win rate correlation (Looker)

Build a Looker Explore on conversations joined to signals. Use Looker's measure types (count, average, ratio) to calculate win rates by signal quartile. Looker connects natively to BigQuery with zero driver configuration — point it at your dataset, define your LookML model, and every dimension in your schema is immediately explorable. Slice win rates by signal_name, scoring_version, time period, and team.

Rep cohort trend (Connected Sheets)

Use BigQuery Connected Sheets to pull signal trends directly into Google Sheets. Sales managers get a self-service pivot table that queries live warehouse data — no Looker license needed, no exports, no stale CSVs. Filter by rep, time period, and signal type. The sheet refreshes on demand and can be scheduled. This is the fastest path to getting structured conversation data in front of non-technical stakeholders.

Signal correlation with BQML

Use BigQuery ML's CREATE MODEL to build a logistic regression predicting deal outcomes from signal values. Run ML entirely in SQL — no Python, no Jupyter, no data export to a separate ML platform. Pivot signal_name into feature columns, cast signal_value to FLOAT64, and train against a won/lost label. BQML handles feature encoding, train/test splits, and model evaluation. Iterate on features without leaving the warehouse.

Semarize integration

Leveraging Serverless Compute for Signal Analytics

BigQuery's serverless architecture changes how you think about signal storage — you optimise for scan efficiency rather than cluster sizing.

Partition conversation_signals by ingestion date and cluster on call_id. This combination means a query like "show me all signals for calls in the last 7 days" only scans the relevant partitions, and within those partitions BigQuery skips blocks that don't match the call_id filter. On a 10M-row table, this typically reduces bytes scanned by 90%+.

Use ARRAY<STRUCT> for participants in the conversations table instead of a separate join table. BigQuery handles nested/repeated fields natively and they co-locate with the parent row, eliminating shuffle-heavy JOINs. This is a BigQuery-specific advantage over traditional normalised schemas.

BQML lets you build signal-to-outcome correlation models without moving data out of BigQuery. CREATE MODEL with a logistic regression on signal values, confidence scores, and deal outcomes to find which conversation patterns predict closed-won deals.

Connected Sheets lets non-technical stakeholders query signal tables directly from Google Sheets, with BigQuery handling the compute. This bridges the gap between the data warehouse and the people who need to review conversation quality scores without learning SQL.

Full modeling guide: For the complete schema, versioning strategy, and grounding lineage patterns, see Modeling Semarize Outputs.
CRM sync: For pushing signals back to your CRM, see the HubSpot guide or Salesforce guide.

FAQ

Frequently Asked Questions

Explore

Explore Semarize