interlocute.ai beta
Coming Soon

Video Intelligence

Submit a video and receive a structured, searchable index. Three composable profiles — Speech, Visual, and Insights — let you pay only for the signals you need. Combine them freely for full-spectrum analysis.

Speech profile — what was said

The Speech profile extracts a time-aligned transcript with speaker diarization, confidence scores, and automatic language detection. Export captions in VTT, SRT, TTML, or plain text. Each segment includes a speaker ID, so your node knows who said what and when.

Visual profile — what was seen

The Visual profile detects shots and scenes with keyframe extraction, reads on-screen text via OCR, and identifies labels and objects in the video frames. The result is a structured timeline your node can query, cite, and reason over.

Insights profile — what it means

The Insights profile extracts named entities, topics, keywords, and people. It adds time-bounded sentiment and emotion segments, flags audio events (applause, silence, music), and runs content-safety analysis — giving your node a semantic understanding of the entire video.

Composable profiles

Profiles combine freely. Request Speech + Insights for a meeting transcription with sentiment. Request Visual + Insights for a marketing video analysis. Or combine all three for full-spectrum indexing. The platform merges signal sets automatically — you never pay for duplicate work.

AI-generated summaries

Request a textual summary with configurable length (Short, Medium, Long), style (Neutral, Casual, Formal), and custom instructions. Summaries can incorporate keyframe images when a vision-capable model is connected. Submit, poll, retrieve — fully async.

Streaming & playback

Indexed videos produce embeddable player URLs, streaming endpoints, and thumbnail base URLs. Serve preview players to end-users or extract keyframe thumbnails for UI cards — all from a single API call.

Frequently Asked Questions

Video Intelligence

What video formats are supported?
Interlocute supports common video formats including MP4, WebM, and MOV. Videos can be submitted via URL fetch, blob storage upload, or direct upload depending on your use case.
What are the three indexing profiles?
Speech extracts transcripts and captions. Visual detects shots, scenes, OCR text, labels, and objects. Insights extracts entities, topics, sentiment, emotion, audio events, and safety signals. You can combine any or all of them in a single request.
How does profile combination work?
When you select multiple profiles, the platform takes the union of their signal sets and selects the appropriate provider preset automatically. You are billed for the combined analysis, not per-profile — so combining Speech + Insights costs less than running them separately.
Can I get sentiment or emotion analysis?
Yes, via the Insights profile. The index includes time-bounded sentiment segments (Positive, Negative, Neutral) and emotion segments (Joy, Sadness, Anger, Fear, Surprise, Disgust).
How do AI-generated video summaries work?
You request a textual summary specifying length, style, and optional custom instructions. The summary is generated asynchronously — you submit the request, poll for completion, and retrieve the result. Keyframe images can be included for visual context.
Is video processing metered?
Yes. Video indexing operations are metered by duration and complexity. Every submission and result retrieval is logged in your usage ledger with full attribution to the node and API key.
Can I re-index a video with a different profile?
Yes. The ReIndex operation lets you re-process an existing video with a different profile combination — for example, adding Insights to a video that was originally indexed with Speech only — without re-uploading the source file.

Ready to build with Video Intelligence?

Deploy your node in seconds and start using Video Intelligence today.