Semarize

CRM & Data

Azure Synapse — Conversation Analytics in the Microsoft Stack

Azure Synapse Analytics unifies data warehousing, data lake analytics, and data integration in one service. If your organization runs on Microsoft — Dynamics 365, Power BI, Teams, ADLS — Synapse is where conversation intelligence slots into the existing stack without adding another vendor.

Recommended default pattern

ADLS Gen2 staging + COPY INTO dedicated SQL pool. Land files in Azure Data Lake Storage Gen2, run COPY INTO to load into dedicated SQL pool tables. Use Data Factory pipelines to orchestrate the end-to-end flow: fetch from source, stage in ADLS, load into the pool, notify downstream consumers.

Fit check

When to Choose Synapse

Synapse is the strongest choice when your analytics stack is already rooted in the Microsoft ecosystem. Here are the signals that confirm Synapse is the right fit.

Your organization is a Microsoft shop — Dynamics 365, Power BI, Azure AD, and Teams are already in production

Synapse Link for Dataverse gives you near-real-time CRM data in your warehouse without building separate ETL pipelines

Power BI DirectQuery on dedicated SQL pools means dashboards query live warehouse data — no extracts, no stale data

Serverless SQL pools let you query Parquet/JSON files in ADLS without loading them — pay only for data processed

Dedicated SQL pools give you predictable performance for production dashboards and scheduled reporting

Azure Data Factory (integrated into Synapse Studio) provides visual pipeline building for non-engineering teams

Architecture choices

Three Synapse Decisions That Drive Cost and Performance

These three choices determine how well your Synapse deployment handles conversation data at scale. Get them right early to avoid expensive migrations later.

1. Dedicated vs serverless SQL pool

Dedicated pools give guaranteed compute for production dashboards (DWU-based pricing, always running). Serverless pools charge per TB scanned — ideal for ad-hoc exploration and querying data lake files without loading. Many teams use both: serverless for exploration, dedicated for production reporting.

Start with serverless for development, add a dedicated pool when Power BI dashboards go live.

2. Distribution strategy

HASH distribution on call_id for the signals table co-locates signals for a conversation on the same distribution. ROUND_ROBIN for the conversations table spreads data evenly. REPLICATE for small lookup tables (scoring_versions, teams) avoids data movement during JOINs entirely. Wrong distribution causes expensive data movement — the query plan will show BroadcastMoveOperation or ShuffleMoveOperation when distributions are mismatched.

Always check distribution skew after initial loads.

3. Synapse Link for Dataverse

If you use Dynamics 365, enable Synapse Link to automatically replicate CRM data to ADLS. Query it from Synapse alongside conversation signals without building a separate CRM data pipeline. Join opportunity data with call signals in a single SQL query. The trade-off is a slight replication lag (minutes, not seconds).

Enable for every Dataverse table your BI team queries regularly.

Schema design

Schema Design

A two-table model with Synapse-native types, distribution annotations, and clustered columnstore indexes. Partition by call_timestamp on monthly ranges for efficient time-range queries.

conversations

DISTRIBUTION = ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX, PARTITION (call_timestamp monthly)

Columns

call_idNVARCHAR(256)NOT NULL. Primary key. Unique identifier from the source platform.
source_platformNVARCHAR(64)Where the call originated — gong, zoom, teams, etc.
call_timestampDATETIME2When the call started. Partition column (monthly ranges).
duration_secondsINTCall duration in seconds.
participantsNVARCHAR(MAX)JSON string containing participant names, roles, and email addresses.
deal_idNVARCHAR(256)Foreign key to CRM deal/opportunity.
account_idNVARCHAR(256)Foreign key to CRM account/company.
transcript_textNVARCHAR(MAX)Full transcript text.
metadataNVARCHAR(MAX)JSON string for source-specific metadata. Parse with OPENJSON.
ingested_atDATETIME2DEFAULT GETUTCDATE(). When this row was inserted.

conversation_signals

DISTRIBUTION = HASH(call_id), CLUSTERED COLUMNSTORE INDEX

Columns

signal_idNVARCHAR(256)NOT NULL. Unique identifier for this signal row.
call_idNVARCHAR(256)NOT NULL. Foreign key to conversations. HASH distribution key.
signal_nameNVARCHAR(128)Name of the extracted signal — overall_score, risk_flag, pain_point.
signal_valueNVARCHAR(1024)The extracted value. Cast to appropriate type in queries.
signal_typeNVARCHAR(64)Type hint — score, boolean, category, extraction, count.
confidenceFLOATConfidence score from the evaluation (0.0 to 1.0).
scoring_versionNVARCHAR(64)Version of the evaluation framework that produced this signal.
evaluated_atDATETIME2When this signal was generated.
JSON handling: Use NVARCHAR(MAX) for JSON data — Synapse dedicated SQL pools don't have a native JSON type. Parse with OPENJSON for arrays and JSON_VALUE for scalar extractions. For frequently-queried fields, extract into typed columns rather than parsing at query time.
Modeling guide: For Semarize-specific modeling, versioning, and grounding lineage best practices, see the canonical modeling guide.

Data loading

Loading Strategy

ADLS Gen2 staging with COPY INTO is the recommended starting point. It gives you the fastest load path with minimal infrastructure.

ADLS Gen2 + COPY INTO

Recommended starting pattern

1

Land files in ADLS Gen2

Stage Parquet files (preferred), CSV, or JSON in your Azure Data Lake Storage Gen2 container. Organize by date or source for easy partition management.

2

COPY INTO for the fastest load path

COPY INTO reads directly from ADLS with columnar-aware optimization. Parquet format loads significantly faster than CSV and preserves types. COPY INTO handles file-format quirks (delimiters, encoding, compression) without external staging objects.

3

Deduplication pattern

COPY INTO a staging table, then use a DELETE/INSERT pattern within a transaction to merge into the target table. Match on call_id for conversations, signal_id for signals. This prevents duplicate rows from retries or overlapping file drops.

Orchestrating with Data Factory

Data Factory pipelines trigger on schedule or event (new file in ADLS). Copy Activity moves data from external sources to ADLS. Stored Procedure Activity runs COPY INTO and dedup logic. Monitor pipeline runs directly in Synapse Studio — no need to leave the workspace.

Ideal for teams without dedicated data engineering resources.

Serverless SQL pool for ad-hoc querying

Query Parquet files directly in ADLS without loading them. Use OPENROWSET to read files on demand. Create external tables for repeated queries against the same file paths. Good for exploring raw transcripts before deciding what to load into dedicated pools — you only pay for bytes scanned.

Use for exploration and development before committing to dedicated pool schema.

Production operations

Operational Playbook

Synapse-specific practices for keeping your conversation data warehouse performant and cost-efficient in production.

DWU scaling

Dedicated SQL pools scale in DWU increments (DW100c to DW30000c). Start at DW100c for development. Scale up for production loads and reporting hours. Scale down overnight. Pause the pool entirely during off-hours to stop billing — a paused pool costs nothing.

Distribution skew monitoring

Check DBCC PDW_SHOWSPACEUSED to identify distribution skew. If one distribution has significantly more data than others, your HASH key may be poorly chosen. High-cardinality, evenly-distributed columns make the best HASH keys. Re-evaluate distribution strategy and use CTAS to rebuild if needed.

Statistics maintenance

Synapse doesn’t auto-create statistics. CREATE STATISTICS on columns used in WHERE, JOIN, GROUP BY, and ORDER BY clauses. Update statistics after significant data loads (more than 10% change). Stale statistics cause the query optimizer to generate poor execution plans.

Result set caching

Enable result set caching for frequently-repeated dashboard queries. Cached results return instantly without consuming DWUs. Caches are invalidated automatically when underlying data changes. Particularly effective for Power BI dashboards with predictable query patterns.

CTAS for table rebuilds

Use CREATE TABLE AS SELECT (CTAS) to rebuild tables with better distribution, partitioning, or index strategies. CTAS is faster than ALTER TABLE for structural changes in Synapse. Drop the old table and rename the new one to swap.

Monitoring and cost management

Use Synapse Studio to monitor query performance, DWU utilization, and data movement operations. Set Azure cost alerts at the resource group level. Review the DMV sys.dm_pdw_exec_requests for slow queries and sys.dm_pdw_nodes_db_column_store_row_group_physical_stats for columnstore health.

Microsoft BI ecosystem

Analysis Examples

Three patterns that leverage the Microsoft integration stack to turn conversation data into actionable intelligence.

1. Signal to deal health dashboard (Power BI DirectQuery)

Power BI connects directly to the dedicated SQL pool via DirectQuery. Build a report with a deal-level view showing latest signal scores, risk flags, and trend sparklines. DirectQuery means the report always shows current data — no scheduled refreshes, no stale extracts. Publish to the Power BI service for organization-wide access with row-level security mapped to Azure AD groups.

2. CRM + conversation signals in one view (Synapse Link)

Enable Synapse Link for Dynamics 365 Dataverse. Dynamics opportunity tables replicate automatically to ADLS. Join opportunity data with conversation signals in a single SQL query inside Synapse — no ETL pipeline needed. Build a unified view of pipeline health powered by both CRM metadata and conversation intelligence. Filter by sales stage, territory, or rep.

3. Pipeline monitoring and alerting (Data Factory + Logic Apps)

Data Factory pipelines log success/failure metrics into Azure Monitor. Connect Logic Apps to send Teams notifications when a load fails or when signal counts drop below expected thresholds. Build operational visibility into your conversation data pipeline without custom monitoring infrastructure.

Structured signals

Distribution, Synapse Link, and the Microsoft BI Stack

Synapse sits at the centre of the Microsoft data stack — its value for conversation intelligence comes from native integration with Power BI, Dynamics 365, and Data Factory rather than raw compute speed.

HASH distribution on call_id for conversation_signals ensures all signals for a conversation land on the same distribution. When Power BI issues a DirectQuery that joins conversations and signals on call_id, the join happens locally on each distribution with zero data movement. Monitor distribution skew with DBCC PDW_SHOWSPACEUSED — if one distribution is 3x larger than the mean, your key has hot values.

Synapse Link for Dataverse mirrors Dynamics 365 CRM records (accounts, contacts, opportunities) into a serverless SQL pool in near-real-time. You can JOIN conversation signals in your dedicated pool with live CRM data in the serverless pool using cross-database queries — no ETL pipeline needed for the CRM side. The replication lag is typically under 10 minutes.

NVARCHAR columns in Synapse don’t have a JSONB equivalent with native operators. Use OPENJSON() and JSON_VALUE() to extract fields from stored JSON payloads, but extract frequently-queried signal fields into typed FLOAT and NVARCHAR columns during the COPY INTO stage. Columnar compression on typed columns is 5-8x more efficient than re-parsing JSON on every query.

Result set caching (enabled per session) stores the first execution of a query and returns cached results for subsequent identical queries. For BI dashboards where multiple users view the same signal summary, this avoids re-scanning the table entirely. DWU charges only apply to the first execution.

Full modeling guide: Modeling Semarize Outputs
CRM sync: For pushing signals back to your CRM, see the HubSpot guide or Salesforce guide.

FAQ

Frequently Asked Questions

Explore

Explore Semarize