In-Video AI

AI video intelligence as one API.

Auto-clipping. Multimodal search. Scene + object detection. Captions and translations in 10+ languages. Built into FastPix Agents: Notes, Clipping, Breaking News.

Start building Read the docs

1curl -X POST https://api.fastpix.com/v1/on-demand 
2     -H "Content-Type: application/json" 
3     -u "<username>:<password>"
4     -d '{
5  "inputs": [
6    {
7      "type": "video",
8      "url": "https://static.fastpix.com/fp-sample-video.mp4"
9    }
10  ],
11  "accessPolicy": "public",
12  "metadata": {
13    "key1": "value1"
14  },
15  "maxResolution": "1080p",
16  "mediaQuality": "standard",
17  "moderation": {
18    "type": "video"
19  },
20   "namedEntities": true,
21   "chapters": true,
22     "generate": true,
23}'

9 features per asset, one call Native not bolt-on, no second pipeline Same billing as upload + encode

Knovo built lecture search on In-Video AI. MyClassboard uses chapter detection + transcripts for K-12 content. Multiple OTT customers use NSFW moderation as table stakes.

TRUSTED BY PRODUCT TEAMS SHIPPING VIDEO AT SCALE

Why In-Video AI exists

Three reasons it ships in the core API, not as a second pipeline.

One API call, not two.

Set in_video_ai flags on the same asset.create call that uploads and encodes. Results are ready when encoding finishes. No second model to call, no second invoice.

No model picking required.

We pick and tune the underlying model. Scence analysis, chapters, transcripts, search, summary, NER, moderation. You get features, not a model gateway.

Same billing as the rest.

AI features are priced per-minute alongside encoding. No separate AI invoice. No surprise per-token billing.

How it works

Nine AI features, one toggle.

TL;DR: 3 calls from file to AI output

PATCH AI_flags

Pick which features you want at upload time. Flags live on the same asset.create call.

AI runs inline

AI processing runs as part of the encoding pipeline. No separate model call.

Receive structured output

Webhook fires with structured JSON: chapters, transcript, summary, NER, moderation flags.

Speaker attributed diarization + transcripts

Full transcript exported as text or VTT. Display via FastPix Player or any HLS player.

30+ languages supported
Full transcript exported as text, VTT, or SRT
Native-language transcripts generated from the original audio
Speaker diarization (who said what, when)

languages['en', 'es', 'pt', 'hi']

formatVTT

speaker_diarizationtrue

avg_accuracy97.2%

download_url/transcripts/abc123.vtt

editablevia dashboard API

Auto-chapters + structured entities.

Chapters detected from visual + audio cues. Each chapter is queryable. Named entities (people, places, products, dates), are structured JSON. Use as preview text, recommendation seeds, search indexing.

Chapters with title + timestamp
Returns matches with timestamps
Named Entity Recognition (people, orgs, places, products)
Powers in-product video help, lecture review, support libraries

1curl -X POST https://api.fastpix.com/v1/on-demand 
2     -H "Content-Type: application/json" 
3     -u "<username>:<password>"
4     -d '{
5  "inputs": [
6    {
7      "type": "video",
8      "url": "https://static.fastpix.com/fp-sample-video.mp4"
9    }
10  ],
11  "accessPolicy": "public",
12  "metadata": {
13    "key1": "value1"
14  },
15  "maxResolution": "1080p",
16  "mediaQuality": "standard",
17   "namedEntities": true,
18   "chapters": true,
19}'

AI Video clipping + AI Reframing + Scene analysis

Detect high-engagement moments using scene analysis, make those clips available for social distribution. Reframe clips for vertical, square, and landscape formats.

Generate short clips based on topics, keywords, or speaker activity
Reframe videos for vertical (9:16), square (1:1), and landscape (16:9) formats
Optimized for Shorts, Reels, TikTok, and social distribution
Returned as clips with timestamps and metadata

summary200-word

entities[{type: 'PERSON', name: 'Sarah'}, ...]

topics['live streaming', 'monetization']

nsfw_score0.02 (safe)

scene_changes12 detected

language_detecteden-US

AI Video Search + Video moderation

NSFW detection, policy violation flags, scene change detection, content classification on every asset. Threshold tuneable. AI video search returns the exact second a topic was mentioned.

NSFW classifier with threshold
Policy violation flags (configurable per platform)
AI video search via /search endpoint
Webhook fires immediately on flag

1{
2  "type": "video.mediaAI.moderation.ready",
3  "object": {
4    "type": "media",
5    "id": "c9ed7167-16e8-45a9-a1ad-170489a94785"
6  },
7  "id": "44ba6038-bb03-4517-b17c-db27c6c10836",
8  "workspace": {
9    "name": "Dashboard videos",
10    "id": "4fa0e115-9209-4b7f-b498-39c750c82bc4"
11  },
12  "status": "ready",
13  "data": {
14    "isModerationGenerated": true,
15    "moderationResult": {
16      "categoryScores": []
17    }
18  },
19  "createdAt": "2026-06-25T10:10:29.052566688Z",
20  "attempts": []
21}

Security, compliance, and partnerships

PartnerNVIDIA Inception

PartnerGoogle Cloud Partner

Customers

“AI video search across 200+ hours of educational video was a whole separate roadmap for us. With In-Video AI, the search endpoint shipped with our standard upload flow. No second pipeline.”

Knovo product team

Microlearning AI search

“Auto-chapter detection on every lecture meant teachers stopped manually marking chapters. The transcript + summary feed our content completion + share events you can feed into your recommendation engine. One API call, three workflows automated.”

MyClassboard

K-12 EdTech

Separate AI pipeline to maintain

Knovo + MyClassboard

1 API call

For upload + encode + AI

Both customers

Native language

Transcripts built in

30+ languages supported

Inline billing

AI cost on the same invoice

Both customers

Capabilities that ship

Nine AI primitives, one toggle.

Meeting notes agent

Auto joins the meeting on time, records, gives summary + action items + key decisions, fires the structured payload to your webhook.

Notes agent guide

Auto-chapter detection

Chapters detected from visual + audio cues. Chapter timestamps in webhook.

Chapter detection guide

AI Video search

GET /search returns matches with timestamps. Powers in-product help, lecture review.

Request access

Summary + Named entity recognition

Structured JSON entities. Topic tags.

Summary + NER guide

Video moderation

Threshold-tuneable classifier. Webhook fires on flag. Audit log per decision.

Moderation guide

Scene analysis + AI Video clipping

Detect high-engagement moments using scene analysis, make those clips available for social distribution.

Request access

Verified counts, In-Video AI

AI features per asset, one call

Inline with encoding

30+

Languages for transcripts + captions

EN, ES, PT, FR, DE, HI, AR, etc.

Inline

AI runs in encoding pipeline

No second pipeline

Per-minute

Same billing as encoding

No separate AI invoice

Tech specs

What In-Video AI handles.

Features, languages, output formats, integration patterns.

Transcription languages

English

Spanish

Portuguese

French

German

Hindi

Tamil

Telugu

Bahasa

Vietnamese

Thai

Arabic

30+ total

Caption formats

WebVTT

SRT

TTML

Burn-in subtitles

Chapter detection

Visual scene boundaries

Audio segment boundaries

Title generation

Timestamp accuracy ~1 second

Search

Conversational queries

Returns timestamp + context

Cross-asset library search

Sub-second response

Summary

Markdown output

100-300 words configurable

Topic tags included

Multi-language summaries

Moderation

NSFW classifier

Tunable threshold

Per-frame flags

Configurable policy categories

Webhook on flag

Audit log

NER

People

Organizations

Places

Products

Dates

Custom entity types

Integration

Inline (asset.create flag)

Webhook delivery

Editable in dashboard

API for re-processing

Questions developers ask

In-Video AI questions, answered.

How is In-Video AI different from running my own model?
You don't pick or maintain the model. Set a flag at upload time and the feature is part of the asset. We tune the underlying model. Same per-minute billing as encoding.
How many languages do transcripts support?
30+ languages. Common ones (English, Spanish, Portuguese, Kannada, Malayalam, Hindi, Tamil, Telugu, Bahasa, Vietnamese, Thai, Arabic, French, German) are tested daily.
How does AI video search work?
/v1/video/{id}/search?q=... returns matches with timestamps. The result tells you exactly which second of which video matched, and the surrounding transcript context.
Is moderation real-time?
Yes for VOD: moderation flags within minutes of upload completion. For Live: moderation runs on the live stream with webhook fire on flag.
How do I migrate from AWS Rekognition + Transcribe?
Move the AI flags to the asset.create call instead of running a second pipeline after encoding. Cost typically drops because there's no second egress + ingest cycle. Full AWS migration path.
Are Notes Agent and Clipping Agent built on top of this?
Yes. Agents are pre-built composite endpoints over In-Video AI. Notes Agent uses transcript + chapter + entity extraction. Clipping Agent uses scene detection + face tracking. Both available on request access.
Can I use just transcription without the rest?
Yes. Toggle individual flags. Pay only for what you turn on.
How accurate are auto-captions?
97%+ accuracy on common languages with clear audio. Lower in noisy environments or strong accents. Always editable.
How is In-Video AI different from Cloudflare Stream?
Cloudflare Stream does not currently offer native In-Video AI features. Adding them means stitching a second model pipeline. FastPix runs AI inline with encoding. FastPix vs Cloudflare Stream.
How is In-Video AI different from Cloudinary Video?
Cloudinary's video AI is selective; image AI is their primary. In-Video AI is video-first across captions, chapters, search, summary, NER, moderation. FastPix vs Cloudinary.
Can I tune the NSFW threshold?
Yes. Set threshold in the moderation config (0.0 to 1.0). Webhook fires when any frame exceeds threshold.
Does it work on live streams too?
Yes. Live captions and live NSFW moderation supported. See live.

Pricing

Per-minute AI processing.

Inline with encoding. Same per-minute billing model. See full pricing.

TRANSCRIPTION + CAPTIONS

Per minute transcribed.

$0.048/ minute

30+ languages with VTT and SRT export. Same rate for live auto-generated subtitles.

30+ languages
VTT + SRT export
Speaker diarization

SEARCH + SUMMARY + NER

Per minute analyzed.

$0.0035/ minute

Markdown summaries, structured entities, video chapters, and a conversational search endpoint. Same per-minute rate across NER, chapters, and summary.

Conversational search endpoint
Markdown summary
Named-entity recognition + chapters

MODERATION + SCENE DETECTION

Per minute moderated.

$0.10/ minute

Tunable NSFW / profanity classifier with audit-log-grade webhooks.

NSFW + profanity detection
Threshold tuneable per workload
Webhook + audit log per decision

Three ways to get unstuck

Whatever kind of help you need, there is a path.

Engineering support

Talk to a video engineer.

Stuck on an API call, a webhook signature, or a player integration? Reach the engineering team directly. Response within hours, not days.

Contact engineering

Integration help

Docs, code samples, video tutorials.

Self-serve resources for the most common integrations. Quickstart guides, SDK examples, and detailed playback logs in your dashboard.

Browse the docs

Solution architect

Plan the rollout with a human.

New integration, migration off another platform, or a complex multi-tenant build. Book a session with a FastPix solution architect.

Join the Slack community

Developer resources

AI video intelligence as one API.

Three reasons it ships in the core API, not as a second pipeline.

Nine AI features, one toggle.

Speaker attributed diarization + transcripts

Auto-chapters + structured entities.

AI Video clipping + AI Reframing + Scene analysis

AI Video Search + Video moderation

Nine AI primitives, one toggle.

Meeting notes agent

Auto-chapter detection

AI Video search

Summary + Named entity recognition

Video moderation

Scene analysis + AI Video clipping

What In-Video AI handles.

Transcription languages

Caption formats

Chapter detection

Search

Summary

Moderation

NER

Integration

In-Video AI questions, answered.

How is In-Video AI different from running my own model?

How many languages do transcripts support?

How does AI video search work?

Is moderation real-time?

Per-minute AI processing.

Whatever kind of help you need, there is a path.

Everything you need to start building.

Five-minute quick-start

Full API reference

Webhook reference

Code samples

Slack community

Service status