In-Video AI

AI video intelligence as one API.

Auto-clipping. Multimodal search. Scene + object detection. Captions and translations in 10+ languages. Built into FastPix Agents: Notes, Clipping, Breaking News.

curl -X POST https://api.fastpix.com/v1/on-demand
-H "Content-Type: application/json"
-u "<username>:<password>"
-d '{
"inputs": [
{
"type": "video",
"url": "https://static.fastpix.com/fp-sample-video.mp4"
}
],
"accessPolicy": "public",
"metadata": {
"key1": "value1"
},
"maxResolution": "1080p",
"mediaQuality": "standard",
"moderation": {
"type": "video"
},
"namedEntities": true,
"chapters": true,
"generate": true,
}'

9 features per asset, one call  Native not bolt-on, no second pipeline  Same billing as upload + encode

Knovo built lecture search on In-Video AI. MyClassboard uses chapter detection + transcripts for K-12 content. Multiple OTT customers use NSFW moderation as table stakes.

TRUSTED BY PRODUCT TEAMS SHIPPING VIDEO AT SCALE

Customer logoCustomer logoCustomer logoCustomer logoCustomer logoCustomer logo

Why In-Video AI exists

Three reasons it ships in the core API, not as a second pipeline.

01

One API call, not two.

Set in_video_ai flags on the same asset.create call that uploads and encodes. Results are ready when encoding finishes. No second model to call, no second invoice.

02

No model picking required.

We pick and tune the underlying model. Scence analysis, chapters, transcripts, search, summary, NER, moderation. You get features, not a model gateway.

03

Same billing as the rest.

AI features are priced per-minute alongside encoding. No separate AI invoice. No surprise per-token billing.

How it works

Nine AI features, one toggle.

TL;DR: 3 calls from file to AI output

01

PATCH AI_flags

Pick which features you want at upload time. Flags live on the same asset.create call.

02

AI runs inline

AI processing runs as part of the encoding pipeline. No separate model call.

03

Receive structured output

Webhook fires with structured JSON: chapters, transcript, summary, NER, moderation flags.

Speaker attributed diarization + transcripts

Full transcript exported as text or VTT. Display via FastPix Player or any HLS player.

  • 30+ languages supported
  • Full transcript exported as text, VTT, or SRT
  • Native-language transcripts generated from the original audio
  • Speaker diarization (who said what, when)

languages['en', 'es', 'pt', 'hi']
formatVTT
speaker_diarizationtrue
avg_accuracy97.2%
download_url/transcripts/abc123.vtt
editablevia dashboard API

Auto-chapters + structured entities.

Chapters detected from visual + audio cues. Each chapter is queryable. Named entities (people, places, products, dates), are structured JSON. Use as preview text, recommendation seeds, search indexing.

  • Chapters with title + timestamp
  • Returns matches with timestamps
  • Named Entity Recognition (people, orgs, places, products)
  • Powers in-product video help, lecture review, support libraries
curl -X POST https://api.fastpix.com/v1/on-demand
-H "Content-Type: application/json"
-u "<username>:<password>"
-d '{
"inputs": [
{
"type": "video",
"url": "https://static.fastpix.com/fp-sample-video.mp4"
}
],
"accessPolicy": "public",
"metadata": {
"key1": "value1"
},
"maxResolution": "1080p",
"mediaQuality": "standard",
"namedEntities": true,
"chapters": true,
}'

AI Video clipping + AI Reframing + Scene analysis

Detect high-engagement moments using scene analysis, make those clips available for social distribution. Reframe clips for vertical, square, and landscape formats.

  • Generate short clips based on topics, keywords, or speaker activity
  • Reframe videos for vertical (9:16), square (1:1), and landscape (16:9) formats
  • Optimized for Shorts, Reels, TikTok, and social distribution
  • Returned as clips with timestamps and metadata

summary200-word
entities[{type: 'PERSON', name: 'Sarah'}, ...]
topics['live streaming', 'monetization']
nsfw_score0.02 (safe)
scene_changes12 detected
language_detecteden-US

AI Video Search + Video moderation

NSFW detection, policy violation flags, scene change detection, content classification on every asset. Threshold tuneable. AI video search returns the exact second a topic was mentioned.

  • NSFW classifier with threshold
  • Policy violation flags (configurable per platform)
  • AI video search via /search endpoint
  • Webhook fires immediately on flag
{
"type": "video.mediaAI.moderation.ready",
"object": {
"type": "media",
"id": "c9ed7167-16e8-45a9-a1ad-170489a94785"
},
"id": "44ba6038-bb03-4517-b17c-db27c6c10836",
"workspace": {
"name": "Dashboard videos",
"id": "4fa0e115-9209-4b7f-b498-39c750c82bc4"
},
"status": "ready",
"data": {
"isModerationGenerated": true,
"moderationResult": {
"categoryScores": []
}
},
"createdAt": "2026-06-25T10:10:29.052566688Z",
"attempts": []
}

Security, compliance, and partnerships

PartnerNVIDIA Inception
PartnerGoogle Cloud Partner

Customers

AI video search across 200+ hours of educational video was a whole separate roadmap for us. With In-Video AI, the search endpoint shipped with our standard upload flow. No second pipeline.

Knovo product team

Microlearning AI search

Auto-chapter detection on every lecture meant teachers stopped manually marking chapters. The transcript + summary feed our content completion + share events you can feed into your recommendation engine. One API call, three workflows automated.

MyClassboard

K-12 EdTech

0

Separate AI pipeline to maintain

Knovo + MyClassboard

1 API call

For upload + encode + AI

Both customers

Native language

Transcripts built in

30+ languages supported

Inline billing

AI cost on the same invoice

Both customers

Verified counts, In-Video AI

9

AI features per asset, one call

Inline with encoding

30+

Languages for transcripts + captions

EN, ES, PT, FR, DE, HI, AR, etc.

Inline

AI runs in encoding pipeline

No second pipeline

Per-minute

Same billing as encoding

No separate AI invoice

Tech specs

What In-Video AI handles.

Features, languages, output formats, integration patterns.

Transcription languages

English
English
Spanish
Spanish
Portuguese
Portuguese
French
French
German
German
Hindi
Hindi
Tamil
Tamil
Telugu
Telugu
Bahasa
Bahasa
Vietnamese
Vietnamese
Thai
Thai
Arabic
Arabic
30+ total
30+ total

Caption formats

WebVTT
WebVTT
SRT
SRT
TTML
TTML
Burn-in subtitles
Burn-in subtitles

Chapter detection

Visual scene boundaries
Visual scene boundaries
Audio segment boundaries
Audio segment boundaries
Title generation
Title generation
Timestamp accuracy ~1 second
Timestamp accuracy ~1 second

Search

Conversational queries
Conversational queries
Returns timestamp + context
Returns timestamp + context
Cross-asset library search
Cross-asset library search
Sub-second response
Sub-second response

Summary

Markdown output
Markdown output
100-300 words configurable
100-300 words configurable
Topic tags included
Topic tags included
Multi-language summaries
Multi-language summaries

Moderation

NSFW classifier
NSFW classifier
Tunable threshold
Tunable threshold
Per-frame flags
Per-frame flags
Configurable policy categories
Configurable policy categories
Webhook on flag
Webhook on flag
Audit log
Audit log

NER

People
People
Organizations
Organizations
Places
Places
Products
Products
Dates
Dates
Custom entity types
Custom entity types

Integration

Inline (asset.create flag)
Inline (asset.create flag)
Webhook delivery
Webhook delivery
Editable in dashboard
Editable in dashboard
API for re-processing
API for re-processing

Questions developers ask

In-Video AI questions, answered.

  • How is In-Video AI different from running my own model?

    You don't pick or maintain the model. Set a flag at upload time and the feature is part of the asset. We tune the underlying model. Same per-minute billing as encoding.
  • How many languages do transcripts support?

    30+ languages. Common ones (English, Spanish, Portuguese, Kannada, Malayalam, Hindi, Tamil, Telugu, Bahasa, Vietnamese, Thai, Arabic, French, German) are tested daily.
  • How does AI video search work?

    /v1/video/{id}/search?q=... returns matches with timestamps. The result tells you exactly which second of which video matched, and the surrounding transcript context.
  • Is moderation real-time?

    Yes for VOD: moderation flags within minutes of upload completion. For Live: moderation runs on the live stream with webhook fire on flag.

Pricing

Per-minute AI processing.

Inline with encoding. Same per-minute billing model. See full pricing.

TRANSCRIPTION + CAPTIONS

Per minute transcribed.

$0.048/ minute

30+ languages with VTT and SRT export. Same rate for live auto-generated subtitles.

  • 30+ languages
  • VTT + SRT export
  • Speaker diarization

SEARCH + SUMMARY + NER

Per minute analyzed.

$0.0035/ minute

Markdown summaries, structured entities, video chapters, and a conversational search endpoint. Same per-minute rate across NER, chapters, and summary.

  • Conversational search endpoint
  • Markdown summary
  • Named-entity recognition + chapters

MODERATION + SCENE DETECTION

Per minute moderated.

$0.10/ minute

Tunable NSFW / profanity classifier with audit-log-grade webhooks.

  • NSFW + profanity detection
  • Threshold tuneable per workload
  • Webhook + audit log per decision