Audio Transcribe API Reference
Complete reference for the POST /media/audio/transcribe endpoint — capability gating, multipart upload, settings, and response format.
Overview
The POST /media/audio/transcribe endpoint transcribes audio files to text.
It is a node-scoped runtime endpoint that reuses the platform's existing transcription pipeline, including
usage tracking and cost attribution.
media.audio and media.audio.transcribe
capabilities to be enabled on the node. If either is disabled, the endpoint returns 404 Not Found (capability existence is not leaked).
Every node is addressable in two ways:
https://YOUR-NODE-ALIAS.interlocute.ai/media/audio/transcribeSubdomain-based routing. The node is resolved from the subdomain automatically.
https://api.interlocute.ai/nodes/{nodeId}/media/audio/transcribePath-based routing. Pass the node ID directly in the URL.
Authentication
The endpoint requires authentication via Bearer token — either a JWT or a tenant/node-scoped API key. Anonymous access is not supported for media endpoints.
Authorization: Bearer YOUR_API_KEYCapability gating
Two capabilities must be enabled on the node for this endpoint to respond:
media.audio
Base audio media capability — gates all audio sub-features.
media.audio.transcribe
Transcription sub-capability.
If either capability is disabled, the endpoint returns 404 Not Found. This is an intentional security measure: the existence of the capability is not disclosed when it is off.
Request
The request must be multipart/form-data with a required file part named audio.
Form fields
audio
Required — The audio file to transcribe.
Accepted types: audio/mpeg, audio/wav, audio/mp4, audio/webm, audio/ogg, audio/mp3, audio/m4a. Max 25 MB.
language
Optional — Language code hint (e.g. en, fr).
Improves accuracy when the spoken language is known in advance.
format
Optional — Response format. Values: text (default) or segments.
When segments, the response includes timed segments with start/end offsets.
maxSeconds
Optional — Maximum audio duration in seconds to process.
Must be ≤ the node's configured limit (default 60s, hard limit 300s). Returns 400 if exceeded.
diarization
Optional — Boolean. Speaker diarization (placeholder, currently ignored).
Response (200 OK)
{
"text": "Hello, this is a test recording.",
"segments": [
{ "id": 0, "start": 0.0, "end": 2.5, "text": "Hello, this is" },
{ "id": 1, "start": 2.5, "end": 4.1, "text": "a test recording." }
],
"language": "en",
"requestId": "a1b2c3d4-...",
"provider": "OpenAI",
"model": "whisper-1",
"durationMs": 1234
}
text
The full transcribed text.
segments
Optional timed segments (only when format=segments).
language
Language code, if provided in the request.
requestId
Correlation ID for tracking and support.
durationMs
Server-side processing time in milliseconds.
Error responses
400
Missing audio file, unsupported content type, maxSeconds exceeded, or malformed form data.
401
Missing or invalid API key / JWT.
403
API key is node-scoped and does not match the target node.
404
Node not found, or media.audio / media.audio.transcribe capability is disabled.
500
Transcription provider error.
cURL example
curl -X POST \
https://YOUR-NODE-ALIAS.interlocute.ai/media/audio/transcribe \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "audio=@recording.mp3" \
-F "language=en" \
-F "format=segments"Placeholder sub-capabilities
The following audio sub-capabilities are registered in the capability model but are not yet implemented. They have no runtime endpoints in v1. When enabled via capability configuration, they will produce 404 (no endpoint exists).
media.audio.sentiment
Sentiment analysis on audio content.
media.audio.summary
Audio content summarization.
media.audio.languageDetect
Spoken language detection.