Semarize

Get Your Data

AssemblyAI - How to Get Your Transcript Data

A practical guide to getting your transcript data out of AssemblyAI - covering API access, historical backfill, incremental polling, webhook-triggered flows, and how to route structured data into your downstream systems.

What you'll learn

  • What transcript data you can extract from AssemblyAI - full text, word-level timestamps, speaker labels, and analysis features
  • How to access data via the AssemblyAI API - authentication, endpoints, and polling patterns
  • Three extraction patterns: historical backfill, incremental polling, and webhook-triggered
  • How to connect AssemblyAI data pipelines to Zapier, n8n, and Make
  • Advanced use cases - data science pipelines, compliance monitoring, QA automation, and custom analytics dashboards

Data

What Data You Can Extract From AssemblyAI

AssemblyAI is a speech-to-text API - you send it audio, and it returns structured transcript data. Unlike recording platforms, AssemblyAI doesn't capture audio itself. The value is in the rich structured output it produces from your audio files: transcripts, speaker labels, confidence scores, and a range of audio intelligence features.

Common fields teams care about

Full transcript text
Word-level timestamps and confidence scores
Speaker labels (via diarization)
Sentiment analysis per utterance
Entity detection (names, locations, etc.)
Auto chapters with summaries
PII redaction results
Content safety labels
Custom vocabulary / keyword boosting
Transcript ID and processing status

API Access

How to Get Transcripts via the AssemblyAI API

AssemblyAI exposes a REST API for transcription. The workflow is: authenticate with an API key, submit audio for transcription, poll for completion (or use a webhook), then retrieve the finished transcript.

1

Authenticate

AssemblyAI uses API key authentication. Pass your key in the authorization header on every request. Generate your API key from the AssemblyAI dashboard.

authorization: <your_api_key>
Content-Type: application/json
Your API key has full access to your account. Store it securely and never expose it in client-side code. Rotate your key from the AssemblyAI dashboard if compromised.
2

Submit audio for transcription

Send a POST /v2/transcript request with the audio_url of your file. Enable optional features like speaker diarization, sentiment analysis, or PII redaction in the same request.

POST https://api.assemblyai.com/v2/transcript

{
  "audio_url": "https://storage.example.com/call-recording.mp3",
  "speaker_labels": true,
  "sentiment_analysis": true,
  "entity_detection": true,
  "auto_chapters": true
}

The response returns a transcript object with an id and status: "queued". Use this ID to poll for completion or include a webhook_url to be notified when processing finishes.

3

Retrieve the completed transcript

Poll GET /v2/transcript/{id} until the status field changes to "completed". The response contains the full transcript text, words array, utterances, and any analysis features you enabled.

GET https://api.assemblyai.com/v2/transcript/abc123def456

// Response (status: "completed")
{
  "id": "abc123def456",
  "status": "completed",
  "text": "Full transcript text here...",
  "words": [{ "text": "Full", "start": 0, "end": 300, "confidence": 0.97 }],
  "utterances": [{ "speaker": "A", "text": "...", "start": 0, "end": 5000 }]
}

The text field contains the full concatenated transcript. The words array provides per-word timestamps and confidence. The utterances array (when diarization is enabled) groups text by speaker.

4

List transcripts and handle rate limits

Listing transcripts

Use GET /v2/transcript to list all your transcripts. Results are paginated - use the limit and after_id parameters to page through results. Transcripts are retained for 90 days.

Rate limits

AssemblyAI enforces rate limits on both API requests and concurrent transcription jobs. When you hit a limit, back off and retry. For bulk operations, queue submissions and process completions as they arrive rather than submitting all at once.

Patterns

Key Extraction Flows

There are three practical patterns for getting transcripts out of AssemblyAI. The right choice depends on whether you're doing a one-off migration, running ongoing extraction, or need near real-time processing as transcripts complete.

Backfill (Historical Export)

One-off migration of existing transcripts

1

Call GET /v2/transcript to list all existing transcripts. Paginate through the full result set using the after_id parameter, collecting all transcript IDs

2

For each transcript ID, fetch the full data via GET /v2/transcript/{id}. Pace requests to stay within rate limits

3

Store each transcript with its metadata (ID, audio URL, status, created date, enabled features) in your data warehouse or object store

4

Note the 90-day retention window - transcripts older than 90 days are automatically deleted. Run your backfill before data expires

5

Once the backfill completes, run your analysis pipeline against the stored data in bulk

Tip: Persist your last-seen transcript ID between batches. If the process is interrupted, you can resume from where you left off using the after_id parameter.

Incremental Polling

Ongoing extraction on a schedule

1

Set a cron job or scheduled trigger (hourly, daily, etc.) that runs your extraction script

2

On each run, call GET /v2/transcript to list recent transcripts. Filter for status 'completed' and IDs newer than your last checkpoint

3

Fetch the full transcript data for each new ID. Use the transcript ID as a deduplication key to avoid reprocessing

4

Route each transcript and its metadata to your downstream pipeline - analysis tool, warehouse, or automation platform

5

Update your stored checkpoint (last transcript ID or timestamp) for the next poll cycle

Tip: Only fetch transcripts with status "completed". Queued or processing transcripts don't have data yet and will waste API calls.

Webhook-Triggered

Near real-time on transcript completion

1

When submitting a transcription request via POST /v2/transcript, include a webhook_url parameter pointing to your endpoint

2

AssemblyAI POSTs to your webhook URL when the transcript is completed (or if it fails). The payload includes the transcript ID and status

3

On webhook receipt, fetch the full transcript via GET /v2/transcript/{id} using the ID from the webhook payload

4

Route the transcript and metadata downstream - to your analysis pipeline, database, or automation tool

Note: Webhooks are set per-transcript at submission time, not globally. Every POST /v2/transcript request needs the webhook_url if you want a completion notification for that specific job.

Automation

Send AssemblyAI Transcripts to Automation Tools

Once you can extract transcripts from AssemblyAI, the next step is routing them through Semarize for structured analysis and into your downstream systems. Below are end-to-end example flows - each showing the full pipeline from AssemblyAI through Semarize evaluation to CRM, Slack, or database output.

ZapierNo-code automation

AssemblyAI → Zapier → Semarize → CRM

Detect completed AssemblyAI transcriptions via webhook, fetch the transcript, send it to Semarize for structured analysis, then write the scored output - signals, flags, and evidence - directly to your CRM.

Example Zap
Trigger: Webhook by Zapier
Catches AssemblyAI completion webhook
Type: Catch Hook
Payload: transcript_id, status
Webhooks by Zapier
Fetch transcript from AssemblyAI
Method: GET
URL: https://api.assemblyai.com/v2/transcript/{{transcript_id}}
Header: authorization: <api_key>
Transcript returned
Webhooks by Zapier
POST /v1/runs (sync) to Semarize
Method: POST
URL: https://api.semarize.com/v1/runs
Auth: Bearer smz_live_...
Body: { kit_code, mode: "sync", input: { transcript } }
Structured output returned
Formatter by Zapier
Extract brick values from Semarize response
Extract: bricks.sentiment_score.value
Extract: bricks.risk_flag.value
Extract: bricks.key_topics.value
Salesforce - Update Record
Write scored signals to record
Object: Contact or Case
Sentiment: {{sentiment_score}}
Risk Flag: {{risk_flag}}
Topics: {{key_topics}}

Setup steps

1

Create a new Zap. Choose "Webhooks by Zapier" as the trigger (Catch Hook). Copy the webhook URL - you'll use this as the webhook_url when submitting transcriptions to AssemblyAI.

2

Add a "Webhooks by Zapier" Action (Custom Request) to fetch the full transcript from AssemblyAI. Set method to GET, URL to https://api.assemblyai.com/v2/transcript/{{transcript_id}}, and add your API key in the authorization header.

3

Add a second "Webhooks by Zapier" Action. Set method to POST, URL to https://api.semarize.com/v1/runs. Add your Semarize API key as a Bearer token. In the body, set kit_code to your Kit, mode to "sync", and map the transcript text into input.transcript.

4

Add a Formatter step to extract individual brick values from the Semarize JSON response - sentiment_score, risk_flag, key_topics, etc.

5

Add a Salesforce (or HubSpot, Sheets, etc.) Action to write the extracted scores and signals to your CRM record.

6

Test each step end-to-end by submitting a test transcription to AssemblyAI with your Zapier webhook URL, then turn on the Zap.

Watch out for: Zapier has step data size limits that can truncate very long transcripts. For recordings over 60 minutes, consider storing the transcript in cloud storage and passing a reference URL instead of inline text. Use mode: "sync" so Semarize returns results inline - Zapier doesn't natively support polling loops.
Learn more about Zapier automation
n8nSelf-hosted workflows

AssemblyAI → n8n → Semarize → Database

Poll AssemblyAI for completed transcripts on a schedule, fetch each one, send to Semarize for analysis, then write the structured scores and signals to your database. n8n's native loop support handles pagination and batch processing.

Example Workflow
Cron - Every Hour
Triggers the workflow on schedule
Mode: Every Hour
Timezone: UTC
HTTP Request - List Transcripts
GET /v2/transcript (AssemblyAI)
Method: GET
URL: https://api.assemblyai.com/v2/transcript
Header: authorization: <api_key>
Params: limit=50, status=completed, after_id={{$lastId}}
For each transcript ID
HTTP Request - Fetch Transcript
GET /v2/transcript/{id} (AssemblyAI)
URL: https://api.assemblyai.com/v2/transcript/{{$json.id}}
Code - Extract Text
Pull transcript text and metadata
Extract: text, utterances, words
HTTP Request - Semarize
POST /v1/runs (sync)
URL: https://api.semarize.com/v1/runs
Auth: Bearer smz_live_...
Body: { kit_code, mode: "sync", input: { transcript } }
Scores & signals returned
Postgres - Insert Row
Write structured output to database
Table: transcript_evaluations
Columns: transcript_id, score, risk_flag, topics

Setup steps

1

Add a Cron node as the workflow trigger. Set the interval to your desired polling frequency (hourly works well for most teams).

2

Add an HTTP Request node to list completed transcripts from AssemblyAI. Set method to GET, URL to https://api.assemblyai.com/v2/transcript, add your API key in the authorization header, and filter by status=completed.

3

Add a Split In Batches node to iterate over the returned transcript IDs. Inside the loop, add an HTTP Request node to fetch each full transcript via GET /v2/transcript/{id}.

4

Add a Code node (JavaScript) to extract the transcript text and any analysis features (utterances, sentiment, entities) from the AssemblyAI response.

5

Add another HTTP Request node to send the transcript to Semarize. Set method to POST, URL to https://api.semarize.com/v1/runs. Add your API key as a Bearer token. Set kit_code, mode to "sync", and map the transcript into input.transcript.

6

Add a Code node to extract the brick values from the Semarize response - overall_score, risk_flag, key_topics, evidence, confidence.

7

Add a Postgres (or MySQL / HTTP Request) node to write the structured output. Use transcript_id as the primary key for upserts.

8

Activate the workflow. Monitor the first few runs to verify Semarize responses are arriving and writing correctly.

Watch out for: Use transcript IDs as deduplication keys to prevent reprocessing. You can also use async mode with n8n's native loop - POST /v1/runs (default async), then poll GET /v1/runs/:runId with a Wait + IF loop until status is "succeeded".
Learn more about n8n automation
MakeVisual automation with branching

AssemblyAI → Make → Semarize → CRM + Slack

Receive AssemblyAI completion webhooks, fetch full transcripts, send each to Semarize for structured analysis, then use a Router to branch the scored output - alert on risk flags via Slack and write all signals to your CRM.

Example Scenario
Webhook - Custom Webhook
Catches AssemblyAI completion callback
Payload: transcript_id, status
HTTP - Fetch Transcript
GET /v2/transcript/{id} (AssemblyAI)
Method: GET
URL: https://api.assemblyai.com/v2/transcript/{{transcript_id}}
Header: authorization: <api_key>
HTTP - Semarize
POST /v1/runs (sync)
URL: https://api.semarize.com/v1/runs
Auth: Bearer smz_live_...
Body: { kit_code, mode: "sync", input: { transcript } }
Structured output
Router - Branch on Risk Flag
Route by Semarize output
Branch 1: IF risk_flag.value = true
Branch 2: ALL (fallthrough)
Branch 1 - Risk detected
Slack - Alert Channel
Notify team about flagged transcript
Channel: #transcript-alerts
Message: Risk on {{transcript_id}}, score: {{score}}
Branch 2 - All transcripts
Salesforce - Update Record
Write all scored signals to record
Sentiment: {{sentiment_score}}
Risk Flag: {{risk_flag}}
Topics: {{key_topics}}

Setup steps

1

Create a new Scenario. Add a Custom Webhook module as the trigger. Copy the webhook URL - use this as the webhook_url when submitting transcriptions to AssemblyAI.

2

Add an HTTP module to fetch the full transcript from AssemblyAI. Set method to GET, URL to https://api.assemblyai.com/v2/transcript/{{transcript_id}}, and add your API key in the authorization header.

3

Add another HTTP module to send the transcript to Semarize. Set URL to https://api.semarize.com/v1/runs, add your Bearer token, and set kit_code, mode to "sync", and input.transcript from the previous step. Parse the response as JSON.

4

Add a Router module. Define Branch 1 with a filter: bricks.risk_flag.value equals true. Leave Branch 2 as a fallthrough (no filter).

5

On Branch 1, add a Slack module to alert your team when risk is detected. Map the score, risk flag, and transcript ID into the message.

6

On Branch 2, add a Salesforce module to write all brick values (sentiment_score, risk_flag, key_topics) to the appropriate record.

7

Test by submitting a transcription to AssemblyAI with the Make webhook URL. Verify the complete flow end-to-end.

8

Activate the scenario. Monitor the first few runs in Make's execution log.

Watch out for: Each API call counts as an operation. A scenario processing 50 transcripts uses ~150 operations (fetch + Semarize + write per transcript). Use mode: "sync" to avoid needing a polling loop for each run.
Learn more about Make automation

What you can build

What You Can Do With AssemblyAI Data in Semarize

AssemblyAI gives you transcripts. Semarize gives you structure. Custom scoring, cross-source analysis, compliance auditing, and building your own tools on typed conversation data.

Multi-Tenant Compliance Framework Scoring

Per-Tenant Regulatory Evaluation

What Semarize generates

tenant_framework = "HIPAA"disclosure_compliance = 0.92policy_violations = 3evidence_packages = 1,247

Your SaaS platform uses AssemblyAI to transcribe customer calls as a product feature. Each of your tenants has different compliance requirements. A healthcare client needs HIPAA disclosure verification. A financial services client needs suitability language checks. An insurance client needs claims handling procedure audits. Each tenant defines their own evaluation kit in Semarize, grounded against their own regulatory documents. Every call gets scored against the tenant’s specific rubric — with structured evidence packages per violation. A multi-tenant compliance system built on two APIs: AssemblyAI for transcription, Semarize for document-grounded evaluation.

Learn more about QA & Compliance
Multi-Tenant QA Overview3 tenants · 1,247 calls evaluated
HealthcareHIPAA Disclosure
487 calls
92%
Financial ServicesSuitability Language
412 calls
88%
InsuranceClaims Handling
348 calls
79%
Each tenant scores calls against their own rubric · rubric v3.1

Knowledge-Grounded Agent Accuracy Verification

Factual Verification Against Source Documents

What Semarize generates

product_claim_accurate = falsepolicy_misstated = trueknowledge_gap_topic = "refund_policy"agents_with_gap = 18

Your support team handles hundreds of calls daily. When agents quote return windows, warranty terms, or troubleshooting sequences, are they getting it right? Run a knowledge-grounded kit against your product documentation and policy handbook on every call. Semarize checks whether the return policy quoted was accurate, whether warranty terms matched the current document, and whether the troubleshooting sequence followed the approved guide. After scoring 5,000 calls, the data shows 18 agents consistently misstate the refund policy. Training targets the exact knowledge gap instead of running generic refreshers.

Learn more about QA & Compliance
Structured Feature Pipeline10,000 labelled in 1 week
AssemblyAI Transcript
Semarize Bricks
ML Features
Feature Columns
budget_mentionedbool
true
decision_makerbool
true
urgency_levelenum
high
competitive_pressurefloat
0.67
Model accuracy (propensity-to-buy)89.3%

Curriculum Adherence Scoring

Training Content Quality Evaluation

What Semarize generates

curriculum_coverage = 0.85topic_missed = "objection_handling"accuracy_vs_playbook = 0.91engagement_level = "high"

Your enablement team records training sessions and onboarding workshops. Instead of manually reviewing 2-hour recordings, run a curriculum adherence kit grounded against the actual training document. Semarize checks whether the trainer covered all required topics, whether roleplay exercises met the rubric, and flags any statements that contradict the official playbook. New hires get a structured skills report after each session — and trainers get feedback on what they missed. After scoring 50 sessions, enablement discovers that objection handling gets skipped in 40% of new hire training. Targeted intervention improves new rep ramp time by 3 weeks.

Learn more about Sales Coaching
Content Quality Scorecard50 episodes / week
EpisodeDepthEngageValueDownloads
Ep. 142: Deep Dive on AI Ops910.844.1/10min12.4K
Ep. 143: Quick Takes on Cloud640.521.8/10min5.1K
Ep. 144: Future of DevEx820.713.4/10min9.7K
Episodes scoring >80 on value density get 2.4x more downloads
brand_safety: clear

Custom Conversation Intelligence Pipeline

Your Signals, Your Schema, Your Warehouse

Vibe-coded

What Semarize generates

daily_volume = 500+custom_fields = 8 typedpipeline_latency = "< 3min"storage = "Snowflake"

A data engineer vibe-codes an Airflow DAG that processes every call: AssemblyAI for transcription, Semarize for structured evaluation. The DAG handles 500+ calls per day. Each call lands in Snowflake with YOUR custom typed columns — fields that don’t exist in any platform’s native output: playbook_adherence (float), competitive_claim_accuracy (bool), pricing_error_detected (bool), coaching_priority (varchar), deal_qualification_score (float). dbt models build derived tables: weekly accuracy reports, coaching signal dashboards, and competitive intelligence trends. The BI team builds dashboards on conversation intelligence that’s fully custom, fully owned, and fully queryable.

Learn more about Data Science
Voice-of-Customer DashboardVibe-coded
API rate limitsP5enterprise
Bulk export CSVP3mid-market
SSO integrationP4enterprise
Custom webhooksP2SMB
Trending This Week
“API rate limits”
5 enterprise accounts
Churn Risk
34%
3 accounts flagged

Watch out for

Common Challenges & Gotchas

These are the issues that come up most often when teams start extracting transcripts from AssemblyAI at scale.

Transcript processing is asynchronous

Unlike synchronous APIs, AssemblyAI processes audio asynchronously. You submit a job, then must poll for completion or use a webhook. Build your pipeline to handle this two-step pattern - submit, then retrieve.

90-day retention window

Transcripts are automatically deleted after 90 days. If you don't export them within that window, the data is gone. Set up automated extraction early to avoid losing historical transcripts.

Audio URL accessibility

AssemblyAI requires a publicly accessible URL or an uploaded file to transcribe. If your audio is behind authentication or on a private network, you'll need to upload it to AssemblyAI's servers first via their upload endpoint.

Feature flags affect output shape

The JSON response structure changes based on which features you enable (diarization, sentiment, entity detection, etc.). Your downstream parser needs to handle different response shapes depending on the transcription configuration.

Rate limits on concurrent requests

AssemblyAI enforces limits on concurrent transcription jobs and API requests per minute. For bulk operations, queue submissions and process completions as they arrive rather than submitting everything at once.

Confidence varies by audio quality

Low-quality audio (phone recordings, noisy environments, heavy accents) produces lower confidence scores and more transcription errors. Monitor per-word confidence scores and flag transcripts below your quality threshold.

Duplicate processing without idempotency

Submitting the same audio file twice creates two separate transcript jobs. Without deduplication logic, you can end up processing and storing the same content multiple times. Use audio file hashes or source IDs as dedup keys.

FAQ

Frequently Asked Questions

Explore

Explore Semarize