Semarize

CRM & Data

PostgreSQL — The Operational Database for Conversation Signals

PostgreSQL is your application-layer database. It's where your backend reads and writes. ACID transactions, JSONB for flexible schemas, GIN indexes for fast containment queries, INSERT ON CONFLICT for idempotent writes, and 30 years of battle-tested reliability. This guide covers self-managed PostgreSQL, Amazon RDS, Google Cloud SQL, and any managed provider where you control the database directly — not Postgres-as-a-platform services like Supabase, which have their own guide.

Recommended default ingestion pattern

Application-layer upserts via your client library (psycopg, node-postgres, pgx). INSERT ... ON CONFLICT DO UPDATE keyed on call_id. Run from a cron job, serverless function, or webhook handler. Use COPY for initial bulk loads.

Right tool for the job

When to Choose PostgreSQL

PostgreSQL is not always the answer. But for operational conversation data, it usually is. Choose it when these conditions match your situation.

Your application already uses PostgreSQL — adding conversation data tables is the simplest path with no new infrastructure, no new credentials, no new failure modes.

You need transactional writes with ACID guarantees — inserts, updates, and deletes are atomic and consistent. No eventual consistency surprises when your pipeline retries.

JSONB + GIN indexes give you document-store flexibility inside a relational database — no separate NoSQL layer needed. Store variable-structure metadata alongside typed columns.

INSERT ... ON CONFLICT (upsert) makes idempotent pipeline writes trivial — no staging tables, no merge statements, no dedup logic outside the database.

You need low-latency reads for application features — dashboards, API responses, real-time alerts — not just batch analytics on yesterday’s data.

You’re at a scale where millions (not billions) of rows are the norm — PostgreSQL handles this comfortably with proper indexing. Most conversation intelligence deployments live in this range.

Architecture choices

Three Architecture Trade-offs for Postgres

Most PostgreSQL deployments for conversation data come down to three architectural decisions. Get these right early and everything else follows.

1

When to add table partitioning

Native RANGE partitioning on call_timestamp makes sense above 50–100M rows. Below that, B-tree indexes on call_timestamp are sufficient. Partitioning adds real complexity — partition maintenance, cross-partition query planning, and migration tooling that needs to understand partition boundaries. Don’t add it prematurely. When you do need it, partition by month or quarter and automate new partition creation.

2

JSONB vs dedicated columns

Use JSONB for variable-structure data (participants, source-specific metadata, raw API responses). Extract frequently-queried values into typed columns (signal_value, signal_name, confidence). GIN indexes on JSONB enable @> containment queries; B-tree indexes on typed columns are faster for equality and range lookups. The mistake is going all-JSONB or all-columns — the right answer is both.

3

When to outgrow to a warehouse

PostgreSQL is your operational database. When analytical queries start competing with application queries for CPU and connections, or you need cross-source joins at billions of rows, add a warehouse (Snowflake, BigQuery) alongside — don’t replace PostgreSQL. Use it for writes and real-time reads; use the warehouse for analytics. Logical replication or a lightweight ETL can keep them in sync.

Schema design

Schema Design

Two tables with PostgreSQL-native types. The conversations table holds raw data; the conversation_signals table holds structured outputs. Foreign keys and unique constraints are enforced at the database level — unlike warehouses, your data integrity is guaranteed.

conversations

One row per call

Columns

call_idTEXT PRIMARY KEYUnique identifier from the source platform. Natural key for upserts.
source_platformTEXTWhere the call originated — gong, zoom, teams, etc.
call_timestampTIMESTAMPTZ NOT NULLWhen the call started. Timezone-aware. B-tree index recommended. Partition key for large datasets.
duration_secondsINTEGERCall duration in seconds.
participantsJSONBArray of participant objects — names, roles, email addresses. GIN index for @> containment queries.
deal_idTEXTForeign key to CRM deal/opportunity.
account_idTEXTForeign key to CRM account/company.
transcript_textTEXTFull transcript text. PostgreSQL handles large text natively.
metadataJSONBFlexible storage for source-specific metadata. GIN index for containment queries.
ingested_atTIMESTAMPTZ DEFAULT now()When this row was inserted into the database.

conversation_signals

One row per extracted signal

Columns

signal_idTEXT PRIMARY KEYUnique identifier for this signal row.
call_idTEXT REFERENCES conversations(call_id)Foreign key to conversations table. Database-enforced referential integrity.
signal_nameTEXT NOT NULLName of the extracted signal — overall_score, risk_flag, pain_point, etc.
signal_valueTEXTThe extracted value. Cast to appropriate type in queries.
signal_typeTEXTType hint — score, boolean, category, extraction, count.
confidenceREALConfidence score from the evaluation (0.0 to 1.0).
scoring_versionTEXTVersion of the evaluation framework that produced this signal.
evaluated_atTIMESTAMPTZWhen this signal was generated.

Recommended indexes

Start with these; add more based on query patterns

B-tree on call_timestampFast range queries for time-based filtering and partitioning.
Composite B-tree on (call_id, signal_name)Fast lookups for specific signals on specific calls. Covers the most common join pattern.
GIN on participantsEnables @> containment queries to find calls by participant.
GIN on metadataEnables containment queries on flexible metadata fields.
For Semarize-specific modeling, versioning, and grounding lineage best practices, see the canonical modeling guide.

Data loading

Loading Strategy

Start with application-layer upserts. Scale up with connection pooling and batching. Reach for bulk tools only when you need them.

Recommended starting pattern: Application-layer upserts

Use INSERT ... ON CONFLICT (call_id) DO UPDATE for idempotent writes — retries and re-runs are safe without dedup logic

Your application code (Python, Node, Go) handles normalization and writes via client library — psycopg, node-postgres, pgx

Use COPY FROM for initial historical backfills — orders of magnitude faster than row-by-row INSERT for loading thousands of records at once

Wrap batch inserts in transactions for atomicity — either all rows in a batch commit or none do

Scaling up

Connection pooling

PgBouncer, pgpool-II, or built-in pooling in your client library. Mandatory for any production deployment. PostgreSQL forks a process per connection — without pooling, 100+ connections degrade performance.

Prepared statements

Reduce parse overhead for repeated inserts. Most client libraries support this natively. Significant improvement when inserting thousands of rows per batch.

Queue-based batching

For high-throughput webhooks, buffer in a queue (Redis, SQS) and batch-insert every few seconds. Avoids per-event connection overhead and smooths out traffic spikes.

Also useful: Logical replication to a read replica for analytics queries that shouldn't compete with application writes. pg_cron for scheduled jobs that run inside PostgreSQL itself — materialized view refreshes, partition maintenance, stale data cleanup.

Production concerns

Operational Playbook

PostgreSQL rewards teams who understand its internals. These are the operational areas that matter most for conversation data workloads.

Bloat and VACUUM

PostgreSQL’s MVCC creates dead tuples on updates and deletes. Autovacuum handles most cases, but monitor pg_stat_user_tables for dead tuple counts. For heavily-updated signal tables, tune autovacuum_vacuum_scale_factor down (0.01 instead of default 0.2). Set autovacuum_analyze_scale_factor to 0.005. If you see sequential scans slowing down, dead tuples are likely the cause.

Index maintenance

Monitor index bloat with pgstattuple or pg_stat_user_indexes. REINDEX CONCURRENTLY when bloat exceeds 30–40%. Don’t over-index — every index slows writes and consumes disk. If a query doesn’t use an index, drop it.

Connection pooling

PostgreSQL forks a process per connection. Without pooling, 100+ connections degrade performance significantly. PgBouncer in transaction mode is the standard solution. Most managed providers (RDS, Cloud SQL) have built-in pooling options. Size your pool to match your workload, not your maximum possible connections.

EXPLAIN ANALYZE everything

Before deploying a new query, check its execution plan. Sequential scans on large tables are your enemy. Look for Index Scan or Index Only Scan. If you see a Seq Scan with a filter that discards most rows, you need an index. Make this a habit for every query that hits production.

Backup and recovery

pg_dump for logical backups, pg_basebackup + WAL archiving for point-in-time recovery. Managed providers handle this automatically. Test restores periodically — a backup you’ve never tested is not a backup.

Schema migrations

Use a migration tool (Flyway, Alembic, Prisma Migrate, golang-migrate). Never ALTER TABLE in production without testing on a staging copy first. Additive changes (new nullable columns) are safe — they don’t lock the table. Column type changes and NOT NULL additions lock the table and require careful planning.

Practical examples

Analysis Examples

Three patterns that use PostgreSQL-native tools. No external compute needed.

1

Signal → deal outcome dashboard (Metabase)

Metabase connects directly to PostgreSQL. Build a question that joins conversation_signals to your deals table, groups by signal quartile, and shows win rates. Pin it to a dashboard. Metabase auto-refreshes on schedule. No ETL, no data warehouse, no data transformation layer — just a direct connection to the database your application already writes to.

2

Rep performance trend (materialized view + Grafana)

Create a materialized view that aggregates average scores by rep and week. Refresh it nightly with pg_cron. Connect Grafana to the materialized view for time-series charts. Faster than running the aggregation on every dashboard load because the materialized view stores pre-computed results. The refresh takes seconds; the dashboard query takes milliseconds.

3

Real-time risk alerts (LISTEN/NOTIFY)

When a signal with signal_name = ‘risk_flag’ and signal_value = ‘high’ is inserted, fire a NOTIFY event from a trigger. A listener process picks it up and posts to Slack or creates a CRM task. No polling needed. LISTEN/NOTIFY is built into PostgreSQL — it’s the simplest possible pub/sub for database events.

Structured signals

Using JSONB and Upserts for Signal Ingestion

PostgreSQL's JSONB type and INSERT ON CONFLICT make it uniquely suited for ingesting conversation signals at application layer speed.

Use INSERT ... ON CONFLICT (call_id, signal_name, scoring_version) DO UPDATE to make signal ingestion fully idempotent. If the same evaluation runs twice for the same call, the second write updates in place rather than creating duplicates. This eliminates the need for a staging-table dedup pattern that warehouses require.

Store the full API response as JSONB in the conversations table with a GIN index. This lets you run containment queries like raw_response @> '{"signals": [{"name": "competitor_mentioned"}]}' without extracting every possible field into columns upfront. Add typed columns for fields you query frequently.

Create a materialized view that joins conversations and conversation_signals for your most common dashboard queries. Refresh it on a schedule (REFRESH MATERIALIZED VIEW CONCURRENTLY) to keep dashboards fast without the overhead of maintaining separate summary tables. This bridges the gap between Postgres as an operational DB and the analytical queries your BI tool needs.

Full modeling guide: Modeling Semarize Outputs covers versioning strategies, grounding lineage schemas, and multi-kit deployment patterns.
CRM sync: For pushing signals back to your CRM, see the HubSpot guide or Salesforce guide.

FAQ

Frequently Asked Questions

Explore

Explore Semarize