Audio Transcribe API Reference | Docs

Overview

The POST /media/audio/transcribe endpoint transcribes audio files to text. It is a node-scoped runtime endpoint that reuses the platform's existing transcription pipeline, including usage tracking and cost attribution.

This endpoint requires the media.audio and media.audio.transcribe capabilities to be enabled on the node. If either is disabled, the endpoint returns 404 Not Found (capability existence is not leaked).

Every node is addressable in two ways:

POST https://YOUR-NODE-ALIAS.interlocute.ai/media/audio/transcribe

Subdomain-based routing. The node is resolved from the subdomain automatically.

POST https://api.interlocute.ai/nodes/{nodeId}/media/audio/transcribe

Path-based routing. Pass the node ID directly in the URL.

Authentication

The endpoint requires authentication via Bearer token — either a JWT or a tenant/node-scoped API key. Anonymous access is not supported for media endpoints.

Authorization: Bearer YOUR_API_KEY

Capability gating

Two capabilities must be enabled on the node for this endpoint to respond:

media.audio

Base audio media capability — gates all audio sub-features.

media.audio.transcribe

Transcription sub-capability.

If either capability is disabled, the endpoint returns 404 Not Found. This is an intentional security measure: the existence of the capability is not disclosed when it is off.

Request

The request must be multipart/form-data with a required file part named audio.

Form fields

audio

Required — The audio file to transcribe.

Accepted types: audio/mpeg, audio/wav, audio/mp4, audio/webm, audio/ogg, audio/mp3, audio/m4a. Max 25 MB.

language

Optional — Language code hint (e.g. en, fr).

Improves accuracy when the spoken language is known in advance.

format

Optional — Response format. Values: text (default) or segments.

When segments, the response includes timed segments with start/end offsets.

maxSeconds

Optional — Maximum audio duration in seconds to process.

Must be ≤ the node's configured limit (default 60s, hard limit 300s). Returns 400 if exceeded.

diarization

Optional — Boolean. Speaker diarization (placeholder, currently ignored).

Response (200 OK)

{
  "text": "Hello, this is a test recording.",
  "segments": [
    { "id": 0, "start": 0.0, "end": 2.5, "text": "Hello, this is" },
    { "id": 1, "start": 2.5, "end": 4.1, "text": "a test recording." }
  ],
  "language": "en",
  "requestId": "a1b2c3d4-...",
  "provider": "OpenAI",
  "model": "whisper-1",
  "durationMs": 1234
}

text

The full transcribed text.

segments

Optional timed segments (only when format=segments).

language

Language code, if provided in the request.

requestId

Correlation ID for tracking and support.

durationMs

Server-side processing time in milliseconds.

Error responses

400

Missing audio file, unsupported content type, maxSeconds exceeded, or malformed form data.

401

Missing or invalid API key / JWT.

403

API key is node-scoped and does not match the target node.

404

Node not found, or media.audio / media.audio.transcribe capability is disabled.

500

Transcription provider error.

cURL example

curl -X POST \
  https://YOUR-NODE-ALIAS.interlocute.ai/media/audio/transcribe \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "audio=@recording.mp3" \
  -F "language=en" \
  -F "format=segments"

Placeholder sub-capabilities

The following audio sub-capabilities are registered in the capability model but are not yet implemented. They have no runtime endpoints in v1. When enabled via capability configuration, they will produce 404 (no endpoint exists).

media.audio.sentiment

Sentiment analysis on audio content.

media.audio.summary

Audio content summarization.

media.audio.languageDetect

Spoken language detection.