Project · MIT
Offline, privacy-first AI video analysis with local detection, captioning, transcription, audio events, and evidence-grounded summaries. Runs entirely on your machine — no cloud, no data upload, no telemetry.
AI-Powered Video Analyzer is an open-source Python tool that analyzes a video file through a local AI pipeline and produces structured outputs — without sending any data to a server. Detection, captioning, transcription, audio analysis, and summarization all run on your own hardware.
The CLI entry point is ai-video-analyzer. The primary detection backend is
VisionServeX with D-FINE models. Captioning,
transcription, audio events, and LLM summarization are all optional and activated with the
[full] install or Ollama.
Sending video to a cloud API means that frames, audio, and potentially sensitive content leave your machine. For research, journalism, legal review, or any privacy-sensitive workflow, local processing keeps data under your control. This tool was built to make local video AI accessible without giving up accuracy or structured outputs.
| Stage | Technology | Status |
|---|---|---|
| Adaptive frame sampling | Built-in (scene + motion aware) | Required — always runs |
| Object detection | VisionServeX D-FINE (COCO-80) | Required — primary backend |
| Scene captioning | BLIP (Salesforce) | Optional — needs [full] |
| Speech transcription | Whisper (OpenAI) | Optional — needs [full] |
| Audio event detection | PANNs CNN14 | Optional — needs [full] |
| LLM summarization | Ollama (any local model) | Optional — needs Ollama running |
ASCII pipeline diagram:
Video file
│
├─ Adaptive frame sampling (scene + motion aware)
│
├─ Object detection → VisionServeX D-FINE (per frame)
├─ Scene captioning → BLIP (optional)
│
├─ Audio extraction
│ ├─ Transcription → Whisper (optional)
│ └─ Audio events → PANNs CNN14 (optional)
│
└─ LLM summarization → Ollama (optional)
│
└─ AnalysisReport
├─ <video>_analysis.json
├─ <video>_analysis.md
└─ report.txt
# 1. Clone and install (detection only) git clone https://github.com/arashsajjadi/ai-powered-video-analyzer.git cd ai-powered-video-analyzer python -m pip install -U pip python -m pip install -e ".[vision]" # 2. Check your environment ai-video-analyzer doctor # 3. Analyze a video ai-video-analyzer analyze "/path/to/video.mp4" --preset balanced
Before running an analysis, verify that all required and optional dependencies are in place:
ai-video-analyzer doctor
Checks: Python version, OpenCV, VisionServeX, D-FINE model registry, ffmpeg, PyTorch/GPU,
Whisper, BLIP, PANNs, Ollama, moviepy, Tesseract. Required dependencies exit with
✗. Optional dependencies show an install hint when missing. Exits 0 if all
required dependencies are present.
VisionServeX D-FINE models on COCO-80 classes. Benchmarked on RTX 5080:
| Preset | Model | ~ms/frame | Notes |
|---|---|---|---|
fast | dfine-n | 17–21 | Speed-first; good accuracy |
balanced | dfine-s | 17–27 | Default — best all-round |
quality | dfine-m | 21–32 | Highest COCO accuracy |
quality+ | dfine-l | higher | Maximum accuracy, slowest |
ai-video-analyzer analyze video.mp4 --preset fast ai-video-analyzer analyze video.mp4 --preset quality # Override with a specific model ID ai-video-analyzer analyze video.mp4 --model dfine-s # List all available models ai-video-analyzer list-models
Install optional dependencies to enable captioning, transcription, and audio events:
# Full pipeline (all optional stages) python -m pip install -r pip_requirements.txt # ffmpeg — required for audio stages (Ubuntu/Debian) sudo apt install ffmpeg # Ollama — for local LLM summarization curl -fsSL https://ollama.com/install.sh | sh ollama serve ollama pull phi4:latest
# Full analysis with Ollama summarization
ai-video-analyzer analyze video.mp4 \
--preset balanced \
--whisper-model base \
--ollama-model phi4:latest \
--output-dir ./results
| File | Contents |
|---|---|
<video>_analysis.json | Full structured data: detections, captions, transcript, timings, preset, limitations |
<video>_analysis.md | Human-readable Markdown report with tables |
report.txt | Plain-text legacy report |
# Single preset benchmark ai-video-analyzer benchmark "/path/to/video.mp4" --preset balanced # Compare fast / balanced / quality side-by-side ai-video-analyzer benchmark "/path/to/video.mp4" --compare
Example benchmark output (RTX 5080, 8.2 s clip, 8 frames selected):
Video duration : 8.2s Frames selected : 8 (sampling: 0.08s) Model load+warmup : 0.52s Detection runtime : 0.15s (18.6ms/frame, 53.8fps) Total detections : 48 Top labels : dog(9), person(7) Model : dfine-s
Real-world benchmark results are in reports/benchmarks/ in the repository.
These numbers reflect a specific GPU and clip; actual results vary by hardware and video content.
| Dependency | Purpose | How to install |
|---|---|---|
| VisionServeX (required) | D-FINE object detection | pip install -e ".[vision]" |
| Whisper | Speech transcription | pip install -r pip_requirements.txt |
| BLIP | Scene captioning | pip install -r pip_requirements.txt |
| PANNs CNN14 | Audio event detection | pip install -r pip_requirements.txt |
| ffmpeg | Audio extraction (required for audio stages) | sudo apt install ffmpeg |
| Ollama | Local LLM summarization | ollama.com/install.sh |
| Claim | Status |
|---|---|
| Fire / smoke / weather detection | Not supported as object classes (COCO-80 limitation) |
| Human-level video understanding | Not claimed — output is structured detection + captioning data |
| Real-time performance | Not benchmarked for streaming; designed for offline file analysis |
| Medical / security / compliance certification | No certification — research/development tool |
| Guaranteed zero false positives | Not claimed — detection accuracy depends on model and scene |
| Cloud deployment | Not supported — offline-only design by intent |
AI-Powered Video Analyzer uses VisionServeX as its primary detection backend. VisionServeX is a separate open-source project that provides a local-first Python framework for serving modern computer vision models — including D-FINE, SAM, DINOv2, RF-DETR, and others — through a unified Python API and HTTP gateway.
VisionServeX is licensed under Apache-2.0. AI-Powered Video Analyzer is MIT. Both are independent packages; VisionServeX has many uses beyond video analysis, and AI-Powered Video Analyzer depends on VisionServeX only for its detection stage.
A legacy YOLO-based detection backend (--backend legacy_yolo) is available
via the [legacy-yolo] extra, but it is not recommended for new use. VisionServeX
D-FINE is the default and supported backend.
Developed by Arash Sajjadi, University of Saskatchewan. D-FINE reference: Peng et al. (2024), arXiv:2410.13842.