Project · MIT

AI-Powered Video Analyzer

Name: AI-Powered Video Analyzer
Author: Arash Sajjadi

Offline, privacy-first AI video analysis with local detection, captioning, transcription, audio events, and evidence-grounded summaries. Runs entirely on your machine — no cloud, no data upload, no telemetry.

MIT Python 3.10+ Local / offline VisionServeX D-FINE Optional Ollama Structured reports

GitHub → VisionServeX (detection backend) All packages

AI-Powered Video Analyzer — chalkboard-style diagram showing the local video analysis pipeline: frame sampling, object detection, captioning, transcription, audio events, and LLM summary

What is AI-Powered Video Analyzer?

AI-Powered Video Analyzer is an open-source Python tool that analyzes a video file through a local AI pipeline and produces structured outputs — without sending any data to a server. Detection, captioning, transcription, audio analysis, and summarization all run on your own hardware.

The CLI entry point is ai-video-analyzer. The primary detection backend is VisionServeX with D-FINE models. Captioning, transcription, audio events, and LLM summarization are all optional and activated with the [full] install or Ollama.

Why local video analysis matters

Sending video to a cloud API means that frames, audio, and potentially sensitive content leave your machine. For research, journalism, legal review, or any privacy-sensitive workflow, local processing keeps data under your control. This tool was built to make local video AI accessible without giving up accuracy or structured outputs.

Pipeline overview

Stage	Technology	Status
Adaptive frame sampling	Built-in (scene + motion aware)	Required — always runs
Object detection	VisionServeX D-FINE (COCO-80)	Required — primary backend
Scene captioning	BLIP (Salesforce)	Optional — needs `[full]`
Speech transcription	Whisper (OpenAI)	Optional — needs `[full]`
Audio event detection	PANNs CNN14	Optional — needs `[full]`
LLM summarization	Ollama (any local model)	Optional — needs Ollama running

ASCII pipeline diagram:

text

Video file
    │
    ├─ Adaptive frame sampling  (scene + motion aware)
    │
    ├─ Object detection         → VisionServeX D-FINE (per frame)
    ├─ Scene captioning         → BLIP (optional)
    │
    ├─ Audio extraction
    │       ├─ Transcription    → Whisper (optional)
    │       └─ Audio events     → PANNs CNN14 (optional)
    │
    └─ LLM summarization        → Ollama (optional)
            │
            └─ AnalysisReport
                    ├─ <video>_analysis.json
                    ├─ <video>_analysis.md
                    └─ report.txt

Quick start

bash

# 1. Clone and install (detection only)
git clone https://github.com/arashsajjadi/ai-powered-video-analyzer.git
cd ai-powered-video-analyzer
python -m pip install -U pip
python -m pip install -e ".[vision]"

# 2. Check your environment
ai-video-analyzer doctor

# 3. Analyze a video
ai-video-analyzer analyze "/path/to/video.mp4" --preset balanced

Doctor command

Before running an analysis, verify that all required and optional dependencies are in place:

bash

ai-video-analyzer doctor

Checks: Python version, OpenCV, VisionServeX, D-FINE model registry, ffmpeg, PyTorch/GPU, Whisper, BLIP, PANNs, Ollama, moviepy, Tesseract. Required dependencies exit with ✗. Optional dependencies show an install hint when missing. Exits 0 if all required dependencies are present.

Detection presets

VisionServeX D-FINE models on COCO-80 classes. Benchmarked on RTX 5080:

Preset	Model	~ms/frame	Notes
`fast`	dfine-n	17–21	Speed-first; good accuracy
`balanced`	dfine-s	17–27	Default — best all-round
`quality`	dfine-m	21–32	Highest COCO accuracy
`quality+`	dfine-l	higher	Maximum accuracy, slowest

bash

ai-video-analyzer analyze video.mp4 --preset fast
ai-video-analyzer analyze video.mp4 --preset quality

# Override with a specific model ID
ai-video-analyzer analyze video.mp4 --model dfine-s

# List all available models
ai-video-analyzer list-models

Optional full pipeline

Install optional dependencies to enable captioning, transcription, and audio events:

bash

# Full pipeline (all optional stages)
python -m pip install -r pip_requirements.txt

# ffmpeg — required for audio stages (Ubuntu/Debian)
sudo apt install ffmpeg

# Ollama — for local LLM summarization
curl -fsSL https://ollama.com/install.sh | sh
ollama serve
ollama pull phi4:latest

bash

# Full analysis with Ollama summarization
ai-video-analyzer analyze video.mp4 \
    --preset balanced \
    --whisper-model base \
    --ollama-model phi4:latest \
    --output-dir ./results

Output files

File	Contents
`<video>_analysis.json`	Full structured data: detections, captions, transcript, timings, preset, limitations
`<video>_analysis.md`	Human-readable Markdown report with tables
`report.txt`	Plain-text legacy report

Benchmarking and diagnostics

bash

# Single preset benchmark
ai-video-analyzer benchmark "/path/to/video.mp4" --preset balanced

# Compare fast / balanced / quality side-by-side
ai-video-analyzer benchmark "/path/to/video.mp4" --compare

Example benchmark output (RTX 5080, 8.2 s clip, 8 frames selected):

text

  Video duration     : 8.2s
  Frames selected    : 8  (sampling: 0.08s)
  Model load+warmup  : 0.52s
  Detection runtime  : 0.15s  (18.6ms/frame, 53.8fps)
  Total detections   : 48
  Top labels         : dog(9), person(7)
  Model              : dfine-s

Real-world benchmark results are in reports/benchmarks/ in the repository. These numbers reflect a specific GPU and clip; actual results vary by hardware and video content.

Optional dependencies

Dependency	Purpose	How to install
VisionServeX (required)	D-FINE object detection	`pip install -e ".[vision]"`
Whisper	Speech transcription	`pip install -r pip_requirements.txt`
BLIP	Scene captioning	`pip install -r pip_requirements.txt`
PANNs CNN14	Audio event detection	`pip install -r pip_requirements.txt`
ffmpeg	Audio extraction (required for audio stages)	`sudo apt install ffmpeg`
Ollama	Local LLM summarization	ollama.com/install.sh

Limitations and honest boundaries

COCO-80 class set: D-FINE detects objects from the COCO-80 class set. This does not include fire, smoke, or weather classes. D-FINE will return approximate visual matches (e.g. food textures for fire/smoke frames). The BLIP captioning stage handles such cases in natural language, but this is not equivalent to specialized fire or smoke detection.

Claim	Status
Fire / smoke / weather detection	Not supported as object classes (COCO-80 limitation)
Human-level video understanding	Not claimed — output is structured detection + captioning data
Real-time performance	Not benchmarked for streaming; designed for offline file analysis
Medical / security / compliance certification	No certification — research/development tool
Guaranteed zero false positives	Not claimed — detection accuracy depends on model and scene
Cloud deployment	Not supported — offline-only design by intent

Relationship with VisionServeX

AI-Powered Video Analyzer uses VisionServeX as its primary detection backend. VisionServeX is a separate open-source project that provides a local-first Python framework for serving modern computer vision models — including D-FINE, SAM, DINOv2, RF-DETR, and others — through a unified Python API and HTTP gateway.

VisionServeX is licensed under Apache-2.0. AI-Powered Video Analyzer is MIT. Both are independent packages; VisionServeX has many uses beyond video analysis, and AI-Powered Video Analyzer depends on VisionServeX only for its detection stage.

A legacy YOLO-based detection backend (--backend legacy_yolo) is available via the [legacy-yolo] extra, but it is not recommended for new use. VisionServeX D-FINE is the default and supported backend.

Links and credits

GitHub repository — source code, issue tracker, benchmark results
VisionServeX — local CV model gateway; primary detection backend
D-FINE — Fine-grained Distribution Refined DETR (Peng et al., 2024)
OpenAI Whisper — speech transcription
Salesforce BLIP — image/scene captioning
PANNs — audio event detection (Kong et al., 2020, IEEE/ACM TASLP)
Ollama — local LLM inference
Research mentorship: Dr. Mark Eramian, Image Lab, Department of Computer Science, University of Saskatchewan

Developed by Arash Sajjadi, University of Saskatchewan. D-FINE reference: Peng et al. (2024), arXiv:2410.13842.

Also explore

VisionServeX — the local CV model gateway powering detection in this pipeline
Packages hub — all open-source tools from the TGraphX project
Insights — research articles on graph learning, computer vision, and local AI
GitHub — open an issue, read the docs, or contribute