CRM & Data
BigQuery — Serverless Analytics for Conversation Intelligence
BigQuery is fully serverless. There is no infrastructure to provision, no clusters to size, and no concurrency knobs to tune. You go from zero to querying terabytes of conversation data without touching a single server. Pay-per-query economics mean you only spend money when you actually run queries — and native integration with Looker, Looker Studio, Connected Sheets, and BigQuery ML means your BI layer is one click away from your warehouse.
Recommended default ingestion pattern
GCS load jobs + scheduled queries. Land your files in Cloud Storage, run free load jobs into BigQuery, then use scheduled queries for downstream transforms. This is the cheapest path — load jobs have zero cost. No streaming surcharge, no per-row fees. Only move to streaming inserts when you genuinely need sub-second latency.
Platform fit
When to Choose BigQuery
BigQuery is the right warehouse for conversation intelligence when these conditions match your team.
You're on GCP or want a serverless warehouse with zero infrastructure management — no clusters to start, stop, resize, or monitor.
Pay-per-query (on-demand) pricing fits your usage. You don't query 24/7, so you only pay when you actually run queries — ~$6.25 per TB scanned, nothing when idle.
You want native integration with Looker, Looker Studio, Connected Sheets, and BigQuery ML (BQML) for in-warehouse machine learning — no drivers, no connectors, no middleware.
Nested and repeated fields (STRUCT and ARRAY) let you model participants and signal arrays directly in columns without joins — one row per conversation, everything inline.
You need petabyte-scale scanning without worrying about cluster sizing or concurrency limits. BigQuery separates storage and compute completely.
Slot-based execution means no warehouse sizing decisions. Google handles autoscaling — you never wait for a warehouse to resume or queue behind other queries.
Architecture choices
Three Decisions Before Your First Load Job
Get these three right early. Everything else is adjustable later.
On-demand vs capacity (Editions) pricing
On-demand charges per TB scanned (~$6.25/TB). Editions give you committed slots at a fixed monthly cost with autoscaling. Start on-demand until you understand your query patterns. Evaluate Editions once you're consistently spending $500+/month — committed slots give you predictable billing and priority execution. Don't pre-optimize pricing before you know what your workload looks like.
Require partition filters
Enable require_partition_filter on the conversations table from day one. This forces every query to include a WHERE clause on call_timestamp, which prevents accidental full-table scans that cost real money. Partition by call_timestamp (DATE granularity). Without this guard, a single SELECT * from a BI tool or ad hoc exploration can scan your entire table history.
Dataset location
Choose US or EU multi-region, or a specific single region (us-central1, europe-west2, etc.). This decision is permanent — data does not move after dataset creation, and cross-region queries are not supported. Pick the region closest to your data sources and BI tools. If your GCS buckets are in us-central1, your BigQuery dataset should be too.
Schema design
Schema Design
Two tables — conversations and conversation_signals — using BigQuery-native types. Partition conversations by call_timestamp (DATE). Cluster on the columns you filter most.
conversations
One row per call · Partition by call_timestamp · Cluster on source_platform, deal_id
Columns
conversation_signals
One row per extracted signal · Cluster on call_id, signal_name
Columns
Modeling depth: For Semarize-specific modeling, versioning, and grounding lineage best practices, see the canonical modeling guide.
Data loading
Loading Strategy
Start with GCS load jobs. They are free, reliable, and handle every format you need. Only add complexity when your latency requirements demand it.
GCS load jobs
RecommendedLand files (JSONL, Parquet, Avro) in a Cloud Storage bucket
Run bq load or use the BigQuery client library to create load jobs
Load jobs are FREE — no per-row cost, no streaming surcharge, no compute charges
Schedule with Cloud Scheduler → Cloud Function, or use BigQuery scheduled queries for downstream transforms
Recommended starting point for every team.
When to use streaming
BigQuery Storage Write API for sub-second ingestion latency
Exactly-once semantics available in committed mode — no dedup logic needed
Costs ~$0.05/GB — significantly more expensive than free load jobs
Use only when downstream systems require real-time data availability.
Other ingestion paths
Pub/Sub → Dataflow → BigQuery for fully managed streaming pipelines with Apache Beam
BigQuery Data Transfer Service for scheduled imports from SaaS sources (Google Ads, YouTube, etc.)
Production operations
Operational Playbook
BigQuery-specific operational concerns. These are the things that catch teams off guard after deployment.
Bytes scanned = cost
On-demand pricing charges per TB scanned. Always use column selection — never SELECT *. Preview cost with --dry-run before executing ad hoc queries. A single unfiltered query against a year of transcripts can cost more than a month of well-structured queries.
Partition filter enforcement
Set require_partition_filter = true on conversation tables. This forces WHERE clauses on call_timestamp and prevents accidental full scans. Anyone querying the table must specify a date range. This is the single most effective cost control.
Clustering maintenance
BigQuery auto-reclusters — no manual maintenance, no VACUUM, no OPTIMIZE. But choose cluster columns wisely: pick 1–4 columns you frequently filter on (source_platform, deal_id, signal_name). Cluster order matters — put the most selective column first.
Slot utilization
On Editions, monitor slot usage in INFORMATION_SCHEMA.JOBS_BY_PROJECT. If you're consistently hitting slot limits, increase capacity or set up autoscaling reservations. On-demand users don't need to worry about slots — Google allocates them dynamically.
Query quotas
Default limit is 2,000 queries per day for on-demand users. Monitor usage with INFORMATION_SCHEMA quota views. Most teams querying conversation data never come close, but automated dashboards with frequent refreshes can accumulate fast.
Materialized views
Use materialized views for expensive aggregations that get queried repeatedly — like daily signal summaries or rep performance rollups. BigQuery auto-refreshes them when base data changes. Cost-effective alternative to running the same aggregation query multiple times a day.
BI ecosystem
Analysis Examples
Three ways to turn conversation signals into decisions, all using BigQuery's native BI integrations.
Signal → win rate correlation (Looker)
Build a Looker Explore on conversations joined to signals. Use Looker's measure types (count, average, ratio) to calculate win rates by signal quartile. Looker connects natively to BigQuery with zero driver configuration — point it at your dataset, define your LookML model, and every dimension in your schema is immediately explorable. Slice win rates by signal_name, scoring_version, time period, and team.
Rep cohort trend (Connected Sheets)
Use BigQuery Connected Sheets to pull signal trends directly into Google Sheets. Sales managers get a self-service pivot table that queries live warehouse data — no Looker license needed, no exports, no stale CSVs. Filter by rep, time period, and signal type. The sheet refreshes on demand and can be scheduled. This is the fastest path to getting structured conversation data in front of non-technical stakeholders.
Signal correlation with BQML
Use BigQuery ML's CREATE MODEL to build a logistic regression predicting deal outcomes from signal values. Run ML entirely in SQL — no Python, no Jupyter, no data export to a separate ML platform. Pivot signal_name into feature columns, cast signal_value to FLOAT64, and train against a won/lost label. BQML handles feature encoding, train/test splits, and model evaluation. Iterate on features without leaving the warehouse.
Semarize integration
Leveraging Serverless Compute for Signal Analytics
BigQuery's serverless architecture changes how you think about signal storage — you optimise for scan efficiency rather than cluster sizing.
Partition conversation_signals by ingestion date and cluster on call_id. This combination means a query like "show me all signals for calls in the last 7 days" only scans the relevant partitions, and within those partitions BigQuery skips blocks that don't match the call_id filter. On a 10M-row table, this typically reduces bytes scanned by 90%+.
Use ARRAY<STRUCT> for participants in the conversations table instead of a separate join table. BigQuery handles nested/repeated fields natively and they co-locate with the parent row, eliminating shuffle-heavy JOINs. This is a BigQuery-specific advantage over traditional normalised schemas.
BQML lets you build signal-to-outcome correlation models without moving data out of BigQuery. CREATE MODEL with a logistic regression on signal values, confidence scores, and deal outcomes to find which conversation patterns predict closed-won deals.
Connected Sheets lets non-technical stakeholders query signal tables directly from Google Sheets, with BigQuery handling the compute. This bridges the gap between the data warehouse and the people who need to review conversation quality scores without learning SQL.
FAQ
Frequently Asked Questions
Explore