Semarize

CRM & Data

Redshift — Conversation Analytics in the AWS Data Stack

Redshift is AWS's columnar warehouse. It is fast when you design tables correctly — distribution keys, sort keys, compression — and painful when you don't. This guide covers the Redshift-specific patterns for loading conversation data via S3, designing tables for performant queries, and operating the cluster in production.

What you'll learn

  • When Redshift is the right warehouse for conversation data — and when a different tool fits better
  • The three architectural decisions that determine query performance: distribution keys, sort keys, and VACUUM cadence
  • Schema design with Redshift-native types, SUPER columns, and distribution/sort annotations
  • S3 + COPY loading, manifest files, streaming ingestion, and the staging-table dedup pattern
  • Operational playbook — VACUUM/ANALYZE, WLM queues, concurrency scaling, and skew monitoring
  • Analysis patterns with QuickSight, Redshift ML, and federated queries to Aurora

Fit check

When to Choose Redshift

Redshift is not the only columnar warehouse, but it is the one that makes sense when your infrastructure is already deeply embedded in AWS. Choose Redshift when:

Your data stack is deeply AWS-native

S3, Glue, EMR, Lambda, Kinesis are already running. Redshift slots in without cross-cloud networking, IAM bridging, or data egress costs. Everything speaks the same IAM language.

Spectrum lets you query S3 without loading

Redshift Spectrum creates external tables over S3 data. Query raw transcripts in Parquet or JSONL sitting in S3 without COPY-ing them into the cluster. Pay only for the S3 scan - useful for ad-hoc analysis on data you do not need in hot storage.

Redshift ML brings SageMaker into SQL

CREATE MODEL trains a SageMaker model directly from a SQL query. Predict deal outcomes, score risk, or classify conversations without moving data out of Redshift. Inference runs as a SQL function in your SELECT statement.

Provisioned clusters give predictable pricing

For steady, high-volume workloads, provisioned clusters cost less per query than pay-per-query alternatives. No surprises from a runaway dashboard refresh. You know exactly what the cluster costs per hour.

Serverless is available for bursty workloads

Redshift Serverless offers pay-per-query economics for variable workloads. Simpler operations, no node management. Use it when query volume is unpredictable or you are still experimenting with your data model.

Federated queries JOIN across Redshift and RDS/Aurora

Query Redshift conversation signals and Aurora PostgreSQL CRM data in a single SQL statement. No ETL pipeline to replicate CRM tables into Redshift - the federated query engine reads from both sources at execution time.

Recommended default ingestion pattern: S3 staging + COPY command. This is the only performant way to load data into Redshift. Land files in S3 (Parquet preferred, JSONL acceptable), then run COPY. Individual INSERTs are catastrophically slow in Redshift — 10 to 100x slower than COPY for any non-trivial volume.

Architecture

Three Decisions Before Writing Your First COPY

Most Redshift performance problems and operational headaches trace back to getting one of these three things wrong early. Nail them upfront and you avoid the most common pain points that send teams scrambling to redesign tables under production load.

1

Distribution key choice

DISTSTYLE KEY on call_id for the conversation_signals table - this co-locates all signals for a conversation on the same node, making JOINs to the conversations table fast because both sides of the join are on the same slice. DISTSTYLE EVEN for the conversations table - uniform distribution across nodes. Wrong distribution keys cause expensive cross-node data movement (DS_DIST_ALL_NONE or DS_BCAST_INNER in your EXPLAIN plan) during every JOIN. Check SVV_TABLE_INFO.skew_rows after loading - values above 4.0 mean your distribution is uneven.

2

Sort key strategy

COMPOUND sort key on (call_timestamp) for the conversations table. Redshift stores data in sorted 1MB blocks - a sort key on timestamp gives excellent range query performance for time-filtered queries (WHERE call_timestamp BETWEEN ...). For the signals table, COMPOUND sort on (call_id, evaluated_at) - this means queries that filter on a specific conversation or a time range of evaluations skip entire blocks. Interleaved sort keys are rarely worth the VACUUM overhead for conversation data. Stick with COMPOUND.

3

VACUUM and ANALYZE cadence

Redshift does not auto-vacuum like PostgreSQL. Schedule VACUUM DELETE (reclaims space from deleted rows) and VACUUM SORT (re-sorts rows added since the last sort operation) weekly, or immediately after large loads that use the staging-table DELETE/INSERT pattern. Run ANALYZE on all tables after significant data changes - without it, the query planner uses stale statistics and may choose terrible execution plans. Use SVV_TABLE_INFO to monitor unsorted percentage and act when it exceeds 20%.

Schema design

Schema Design

A two-table model keeps raw conversation data separate from extracted signals. This means you can re-evaluate conversations with updated scoring logic without touching the source data. Use Redshift-native types and distribution/sort annotations throughout.

conversations

One row per call · DISTSTYLE EVEN · SORTKEY(call_timestamp)

Columns

call_idVARCHAR(256)Primary key (informational only - Redshift does not enforce). Unique identifier from the source platform.
source_platformVARCHAR(64)Where the call originated - gong, zoom, teams, etc.
call_timestampTIMESTAMPWhen the call started. Sort key - enables fast range scans for time-filtered queries.
duration_secondsINTEGERCall duration in seconds.
participantsSUPERJSON array of participant objects - names, roles, emails. Query with PartiQL syntax.
deal_idVARCHAR(256)Foreign key to CRM deal/opportunity.
account_idVARCHAR(256)Foreign key to CRM account/company.
transcript_textVARCHAR(65535)Full transcript text. For transcripts exceeding 65535 characters, store in S3 and query via Spectrum.
metadataSUPERFlexible JSON for source-specific fields. Query nested paths with PartiQL dot notation.
ingested_atTIMESTAMPDEFAULT GETDATE(). When this row was loaded into Redshift. Useful for pipeline debugging.

conversation_signals

One row per extracted signal · DISTSTYLE KEY DISTKEY(call_id) · SORTKEY(call_id, evaluated_at)

Columns

signal_idVARCHAR(256)Primary key (informational). Unique identifier for this signal row.
call_idVARCHAR(256)Foreign key to conversations. Distribution key - co-locates signals with their parent conversation on the same node.
signal_nameVARCHAR(128)Name of the extracted signal - overall_score, risk_flag, pain_point, budget_mentioned.
signal_valueVARCHAR(1024)The extracted value. Cast in queries as needed (e.g. signal_value::FLOAT4).
signal_typeVARCHAR(64)Type hint - score, boolean, category, extraction, count.
confidenceFLOAT4Confidence score from the evaluation (0.0 to 1.0).
scoring_versionVARCHAR(64)Version of the evaluation framework that produced this signal.
evaluated_atTIMESTAMPWhen this signal was generated. Part of the compound sort key.
Important: Redshift does NOT enforce PRIMARY KEY or UNIQUE constraints — they are informational only, used by the query planner for optimization hints. Deduplication must be handled in your loading logic (staging table + DELETE/INSERT pattern).
For Semarize-specific modeling, versioning, and grounding lineage best practices, see the canonical modeling guide.

Data loading

Loading Strategy

S3 + COPY is the backbone of Redshift data loading. It is the only performant path. Individual INSERTs are catastrophically slow — Redshift is designed for bulk operations, and COPY parallelizes reads across all slices in your cluster.

Recommended: S3 + COPY

Bulk, parallel, the only way that scales

1

Land files in S3

Write Parquet files (preferred - columnar, compressed, fastest COPY) or JSONL files to an S3 staging prefix. Parquet eliminates the need for Redshift to parse text at load time and compresses 3-5x better than JSONL.

2

Use manifest files

A manifest file is a JSON document that lists exactly which S3 files to load. It prevents partial loads (only some files picked up), enables retries (re-run the same manifest safely), and gives you an audit trail of what was loaded when.

3

Size files for parallel loading

Optimal file size: 1MB to 1GB each. Split your data into a number of files that is a multiple of your cluster's slice count - this ensures every slice gets work and no slice sits idle while others finish. A 4-node dc2.large cluster has 8 slices, so aim for 8, 16, or 24 files.

4

Dedup with staging tables

COPY into a staging table (identical schema, no sort/dist keys needed). DELETE matching rows from the target table WHERE call_id IN (SELECT call_id FROM staging). INSERT INTO target SELECT * FROM staging. Wrap in a single transaction and COMMIT. This is the standard Redshift upsert pattern.

When to use Streaming Ingestion

Kinesis or MSK (Kafka) direct to Redshift

You already have a Kinesis Data Streams or Amazon MSK (Kafka) stream of conversation events and want low-latency ingestion without landing in S3 first

Redshift streaming ingestion creates a materialized view over the stream - data appears in Redshift within seconds of arriving in Kinesis/MSK

Best for operational dashboards that need near-real-time signal updates, not for historical batch loads

Also worth knowing: AWS Glue provides managed ETL for transforming and loading data into Redshift. Zero-ETL integration with Aurora replicates CRM data from Aurora PostgreSQL/MySQL into Redshift without building a pipeline. The Redshift Data API enables serverless query execution from Lambda, Step Functions, or any HTTP client — useful for triggering loads without maintaining a persistent connection.

Operations

Operational Playbook

Redshift-specific operational concerns that matter for conversation data pipelines. These are the things that separate a working prototype from a production system that doesn't wake you up at 3am with a stuck WLM queue.

VACUUM and ANALYZE

Schedule weekly VACUUM DELETE and VACUUM SORT after bulk loads. Run ANALYZE on all tables after significant data changes. Use SVV_TABLE_INFO to check the unsorted column for sort degradation and tbl_skew_rows for distribution skew. Without VACUUM, deleted rows waste disk space and sort order degrades - queries scan more blocks than necessary. Without ANALYZE, the query planner uses stale statistics and produces bad execution plans.

Distribution skew monitoring

Check SVV_TABLE_INFO.skew_rows regularly. Values above 4.0 indicate data skew - one or more nodes hold significantly more data than others. If the signals table skews badly on call_id (some conversations produce hundreds of signals while others produce a handful), consider switching to DISTSTYLE EVEN and accepting the cross-node JOIN overhead. The trade-off is usually worth it when skew is severe.

WLM queue configuration

Separate queues for ETL loads (high memory allocation, low concurrency - 2 to 3 slots) and BI queries (lower memory per query, higher concurrency - 10 to 15 slots). Auto WLM works for most teams - Redshift dynamically allocates memory based on query needs. Switch to manual WLM only when you need guaranteed memory for specific workloads. Always set query timeouts (WLM_QUERY_SLOT_COUNT and max_execution_time) to prevent runaway queries from blocking the entire queue.

Concurrency scaling

Enable concurrency scaling for read-heavy BI workloads. Redshift spins up transient clusters to handle burst read traffic when your main cluster's WLM queues are full. The first hour per day of concurrency scaling is free. Set concurrency_scaling_mode to auto on the WLM queues that serve dashboards and analyst queries. ETL queues generally do not benefit from concurrency scaling.

Late-arriving data

Conversation data often arrives hours or days after the call - transcript processing takes time, metadata gets backfilled, and some platforms batch their API exports. Design your loading for idempotent upserts (the staging + DELETE/INSERT pattern) rather than append-only INSERTs. Late arrivals should update existing rows, not create duplicates. Use ingested_at as an audit column to track when data actually landed versus when the call happened.

Node type and cluster sizing

Start with 2x dc2.large for development - fast local SSD storage, enough compute for prototyping. Move to ra3.xlplus or ra3.4xlarge when you need managed storage (Redshift Managed Storage keeps hot data on local SSD and warm data on S3 transparently). Ra3 separates storage and compute - resize compute without re-loading data. Never start big and scale down - start small and add nodes when you see query queue times or COPY jobs exceeding your SLA.

BI patterns

Analysis Examples

Three patterns that leverage AWS-native BI and ML tools. Each one connects conversation signals to business outcomes using services that integrate natively with Redshift.

Signal to win-rate correlation (QuickSight)

Amazon QuickSight connects natively to Redshift - no JDBC driver setup, no credential management outside of AWS. Build a dataset that joins conversation_signals to your CRM opportunity table on deal_id. Create a calculated field for win rate by signal quartile - deals where discovery_score was in the top 25% closed at what rate versus the bottom 25%? Publish the dashboard with filters for team, time period, and signal type. QuickSight's SPICE engine caches the data so dashboard refreshes do not hit the Redshift cluster on every page load.

Predictive deal scoring (Redshift ML)

Use CREATE MODEL to train a SageMaker model directly from a SQL query - SELECT signal features and deal outcomes from your tables, and Redshift handles the SageMaker integration automatically. Train on historical signals (discovery_score, risk_flag count, objection_count) plus deal outcomes (won/lost). Score active deals with a simple SELECT that includes the ML function as a column. No data movement out of Redshift, no separate ML infrastructure, no Python notebooks to maintain. Retrain monthly as your signal library grows.

Cross-source deal view (federated query)

Use federated queries to JOIN Redshift conversation_signals with Aurora PostgreSQL CRM data in a single SQL statement. No ETL pipeline to replicate CRM tables into Redshift. The query engine reads from both sources at execution time. Best for ad-hoc analysis where building a full replication pipeline is not justified - e.g. sales leadership wants a one-off report correlating conversation risk signals with pipeline stage and close date from Aurora. For regular dashboards, replicate the CRM data into Redshift for better performance.

Structured signals

Columnar Compression and Distribution for Signal Tables

Redshift's columnar architecture means signal tables compress aggressively and scan efficiently — but only if you choose the right distribution and sort keys.

DISTSTYLE KEY on call_id for conversation_signals co-locates all signals for a single conversation on the same slice. When you JOIN conversations and conversation_signals on call_id, Redshift performs a local join with zero data movement across slices. Check SVV_TABLE_INFO.skew_rows stays below 4.0 to confirm even distribution.

COMPOUND SORTKEY (call_timestamp, call_id) on conversation_signals means time-range queries (the most common BI pattern) benefit from zone-map elimination. Redshift skips entire 1MB blocks that fall outside your date range, keeping scans fast even on ra3 managed storage with data in S3.

Use SUPER for the raw API response payload but extract high-cardinality signal fields into typed VARCHAR and FLOAT4 columns. Redshift applies AZ64 or LZO compression per column automatically — typed columns compress 5-10x better than the same data nested inside SUPER, and queries against them run against zone maps.

Spectrum lets you query archived signal data in S3 directly alongside current data in local storage. Define an external table over partitioned Parquet files in S3 and UNION ALL with your local conversation_signals table to query across years of history without keeping it all on expensive local disks.

Modeling guide: For the full schema, versioning strategy, and grounding lineage patterns, see Modeling Semarize Outputs.
CRM sync: For pushing signals back to your CRM, see the HubSpot guide or Salesforce guide.

FAQ

Frequently Asked Questions

Explore

Explore Semarize