CRM & Data
PostgreSQL — The Operational Database for Conversation Signals
PostgreSQL is your application-layer database. It's where your backend reads and writes. ACID transactions, JSONB for flexible schemas, GIN indexes for fast containment queries, INSERT ON CONFLICT for idempotent writes, and 30 years of battle-tested reliability. This guide covers self-managed PostgreSQL, Amazon RDS, Google Cloud SQL, and any managed provider where you control the database directly — not Postgres-as-a-platform services like Supabase, which have their own guide.
Recommended default ingestion pattern
Application-layer upserts via your client library (psycopg, node-postgres, pgx). INSERT ... ON CONFLICT DO UPDATE keyed on call_id. Run from a cron job, serverless function, or webhook handler. Use COPY for initial bulk loads.
Right tool for the job
When to Choose PostgreSQL
PostgreSQL is not always the answer. But for operational conversation data, it usually is. Choose it when these conditions match your situation.
Your application already uses PostgreSQL — adding conversation data tables is the simplest path with no new infrastructure, no new credentials, no new failure modes.
You need transactional writes with ACID guarantees — inserts, updates, and deletes are atomic and consistent. No eventual consistency surprises when your pipeline retries.
JSONB + GIN indexes give you document-store flexibility inside a relational database — no separate NoSQL layer needed. Store variable-structure metadata alongside typed columns.
INSERT ... ON CONFLICT (upsert) makes idempotent pipeline writes trivial — no staging tables, no merge statements, no dedup logic outside the database.
You need low-latency reads for application features — dashboards, API responses, real-time alerts — not just batch analytics on yesterday’s data.
You’re at a scale where millions (not billions) of rows are the norm — PostgreSQL handles this comfortably with proper indexing. Most conversation intelligence deployments live in this range.
Architecture choices
Three Architecture Trade-offs for Postgres
Most PostgreSQL deployments for conversation data come down to three architectural decisions. Get these right early and everything else follows.
When to add table partitioning
Native RANGE partitioning on call_timestamp makes sense above 50–100M rows. Below that, B-tree indexes on call_timestamp are sufficient. Partitioning adds real complexity — partition maintenance, cross-partition query planning, and migration tooling that needs to understand partition boundaries. Don’t add it prematurely. When you do need it, partition by month or quarter and automate new partition creation.
JSONB vs dedicated columns
Use JSONB for variable-structure data (participants, source-specific metadata, raw API responses). Extract frequently-queried values into typed columns (signal_value, signal_name, confidence). GIN indexes on JSONB enable @> containment queries; B-tree indexes on typed columns are faster for equality and range lookups. The mistake is going all-JSONB or all-columns — the right answer is both.
When to outgrow to a warehouse
PostgreSQL is your operational database. When analytical queries start competing with application queries for CPU and connections, or you need cross-source joins at billions of rows, add a warehouse (Snowflake, BigQuery) alongside — don’t replace PostgreSQL. Use it for writes and real-time reads; use the warehouse for analytics. Logical replication or a lightweight ETL can keep them in sync.
Schema design
Schema Design
Two tables with PostgreSQL-native types. The conversations table holds raw data; the conversation_signals table holds structured outputs. Foreign keys and unique constraints are enforced at the database level — unlike warehouses, your data integrity is guaranteed.
conversations
One row per call
Columns
conversation_signals
One row per extracted signal
Columns
Recommended indexes
Start with these; add more based on query patterns
Data loading
Loading Strategy
Start with application-layer upserts. Scale up with connection pooling and batching. Reach for bulk tools only when you need them.
Recommended starting pattern: Application-layer upserts
Use INSERT ... ON CONFLICT (call_id) DO UPDATE for idempotent writes — retries and re-runs are safe without dedup logic
Your application code (Python, Node, Go) handles normalization and writes via client library — psycopg, node-postgres, pgx
Use COPY FROM for initial historical backfills — orders of magnitude faster than row-by-row INSERT for loading thousands of records at once
Wrap batch inserts in transactions for atomicity — either all rows in a batch commit or none do
Scaling up
Connection pooling
PgBouncer, pgpool-II, or built-in pooling in your client library. Mandatory for any production deployment. PostgreSQL forks a process per connection — without pooling, 100+ connections degrade performance.
Prepared statements
Reduce parse overhead for repeated inserts. Most client libraries support this natively. Significant improvement when inserting thousands of rows per batch.
Queue-based batching
For high-throughput webhooks, buffer in a queue (Redis, SQS) and batch-insert every few seconds. Avoids per-event connection overhead and smooths out traffic spikes.
pg_cron for scheduled jobs that run inside PostgreSQL itself — materialized view refreshes, partition maintenance, stale data cleanup.Production concerns
Operational Playbook
PostgreSQL rewards teams who understand its internals. These are the operational areas that matter most for conversation data workloads.
Bloat and VACUUM
PostgreSQL’s MVCC creates dead tuples on updates and deletes. Autovacuum handles most cases, but monitor pg_stat_user_tables for dead tuple counts. For heavily-updated signal tables, tune autovacuum_vacuum_scale_factor down (0.01 instead of default 0.2). Set autovacuum_analyze_scale_factor to 0.005. If you see sequential scans slowing down, dead tuples are likely the cause.
Index maintenance
Monitor index bloat with pgstattuple or pg_stat_user_indexes. REINDEX CONCURRENTLY when bloat exceeds 30–40%. Don’t over-index — every index slows writes and consumes disk. If a query doesn’t use an index, drop it.
Connection pooling
PostgreSQL forks a process per connection. Without pooling, 100+ connections degrade performance significantly. PgBouncer in transaction mode is the standard solution. Most managed providers (RDS, Cloud SQL) have built-in pooling options. Size your pool to match your workload, not your maximum possible connections.
EXPLAIN ANALYZE everything
Before deploying a new query, check its execution plan. Sequential scans on large tables are your enemy. Look for Index Scan or Index Only Scan. If you see a Seq Scan with a filter that discards most rows, you need an index. Make this a habit for every query that hits production.
Backup and recovery
pg_dump for logical backups, pg_basebackup + WAL archiving for point-in-time recovery. Managed providers handle this automatically. Test restores periodically — a backup you’ve never tested is not a backup.
Schema migrations
Use a migration tool (Flyway, Alembic, Prisma Migrate, golang-migrate). Never ALTER TABLE in production without testing on a staging copy first. Additive changes (new nullable columns) are safe — they don’t lock the table. Column type changes and NOT NULL additions lock the table and require careful planning.
Practical examples
Analysis Examples
Three patterns that use PostgreSQL-native tools. No external compute needed.
Signal → deal outcome dashboard (Metabase)
Metabase connects directly to PostgreSQL. Build a question that joins conversation_signals to your deals table, groups by signal quartile, and shows win rates. Pin it to a dashboard. Metabase auto-refreshes on schedule. No ETL, no data warehouse, no data transformation layer — just a direct connection to the database your application already writes to.
Rep performance trend (materialized view + Grafana)
Create a materialized view that aggregates average scores by rep and week. Refresh it nightly with pg_cron. Connect Grafana to the materialized view for time-series charts. Faster than running the aggregation on every dashboard load because the materialized view stores pre-computed results. The refresh takes seconds; the dashboard query takes milliseconds.
Real-time risk alerts (LISTEN/NOTIFY)
When a signal with signal_name = ‘risk_flag’ and signal_value = ‘high’ is inserted, fire a NOTIFY event from a trigger. A listener process picks it up and posts to Slack or creates a CRM task. No polling needed. LISTEN/NOTIFY is built into PostgreSQL — it’s the simplest possible pub/sub for database events.
Structured signals
Using JSONB and Upserts for Signal Ingestion
PostgreSQL's JSONB type and INSERT ON CONFLICT make it uniquely suited for ingesting conversation signals at application layer speed.
Use INSERT ... ON CONFLICT (call_id, signal_name, scoring_version) DO UPDATE to make signal ingestion fully idempotent. If the same evaluation runs twice for the same call, the second write updates in place rather than creating duplicates. This eliminates the need for a staging-table dedup pattern that warehouses require.
Store the full API response as JSONB in the conversations table with a GIN index. This lets you run containment queries like raw_response @> '{"signals": [{"name": "competitor_mentioned"}]}' without extracting every possible field into columns upfront. Add typed columns for fields you query frequently.
Create a materialized view that joins conversations and conversation_signals for your most common dashboard queries. Refresh it on a schedule (REFRESH MATERIALIZED VIEW CONCURRENTLY) to keep dashboards fast without the overhead of maintaining separate summary tables. This bridges the gap between Postgres as an operational DB and the analytical queries your BI tool needs.
FAQ
Frequently Asked Questions
Explore