TGraphX Packages AI-Powered Video Analyzer

Project · MIT

AI-Powered Video Analyzer

Offline, privacy-first AI video analysis with local detection, captioning, transcription, audio events, and evidence-grounded summaries. Runs entirely on your machine — no cloud, no data upload, no telemetry.

MIT Python 3.10+ Local / offline VisionServeX D-FINE Optional Ollama Structured reports
AI-Powered Video Analyzer — chalkboard-style diagram showing the local video analysis pipeline: frame sampling, object detection, captioning, transcription, audio events, and LLM summary

What is AI-Powered Video Analyzer?

AI-Powered Video Analyzer is an open-source Python tool that analyzes a video file through a local AI pipeline and produces structured outputs — without sending any data to a server. Detection, captioning, transcription, audio analysis, and summarization all run on your own hardware.

The CLI entry point is ai-video-analyzer. The primary detection backend is VisionServeX with D-FINE models. Captioning, transcription, audio events, and LLM summarization are all optional and activated with the [full] install or Ollama.

Why local video analysis matters

Sending video to a cloud API means that frames, audio, and potentially sensitive content leave your machine. For research, journalism, legal review, or any privacy-sensitive workflow, local processing keeps data under your control. This tool was built to make local video AI accessible without giving up accuracy or structured outputs.

Pipeline overview

StageTechnologyStatus
Adaptive frame samplingBuilt-in (scene + motion aware)Required — always runs
Object detectionVisionServeX D-FINE (COCO-80)Required — primary backend
Scene captioningBLIP (Salesforce)Optional — needs [full]
Speech transcriptionWhisper (OpenAI)Optional — needs [full]
Audio event detectionPANNs CNN14Optional — needs [full]
LLM summarizationOllama (any local model)Optional — needs Ollama running

ASCII pipeline diagram:

text
Video file
    │
    ├─ Adaptive frame sampling  (scene + motion aware)
    │
    ├─ Object detection         → VisionServeX D-FINE (per frame)
    ├─ Scene captioning         → BLIP (optional)
    │
    ├─ Audio extraction
    │       ├─ Transcription    → Whisper (optional)
    │       └─ Audio events     → PANNs CNN14 (optional)
    │
    └─ LLM summarization        → Ollama (optional)
            │
            └─ AnalysisReport
                    ├─ <video>_analysis.json
                    ├─ <video>_analysis.md
                    └─ report.txt

Quick start

bash
# 1. Clone and install (detection only)
git clone https://github.com/arashsajjadi/ai-powered-video-analyzer.git
cd ai-powered-video-analyzer
python -m pip install -U pip
python -m pip install -e ".[vision]"

# 2. Check your environment
ai-video-analyzer doctor

# 3. Analyze a video
ai-video-analyzer analyze "/path/to/video.mp4" --preset balanced

Doctor command

Before running an analysis, verify that all required and optional dependencies are in place:

bash
ai-video-analyzer doctor

Checks: Python version, OpenCV, VisionServeX, D-FINE model registry, ffmpeg, PyTorch/GPU, Whisper, BLIP, PANNs, Ollama, moviepy, Tesseract. Required dependencies exit with . Optional dependencies show an install hint when missing. Exits 0 if all required dependencies are present.

Detection presets

VisionServeX D-FINE models on COCO-80 classes. Benchmarked on RTX 5080:

PresetModel~ms/frameNotes
fastdfine-n17–21Speed-first; good accuracy
balanceddfine-s17–27Default — best all-round
qualitydfine-m21–32Highest COCO accuracy
quality+dfine-lhigherMaximum accuracy, slowest
bash
ai-video-analyzer analyze video.mp4 --preset fast
ai-video-analyzer analyze video.mp4 --preset quality

# Override with a specific model ID
ai-video-analyzer analyze video.mp4 --model dfine-s

# List all available models
ai-video-analyzer list-models

Optional full pipeline

Install optional dependencies to enable captioning, transcription, and audio events:

bash
# Full pipeline (all optional stages)
python -m pip install -r pip_requirements.txt

# ffmpeg — required for audio stages (Ubuntu/Debian)
sudo apt install ffmpeg

# Ollama — for local LLM summarization
curl -fsSL https://ollama.com/install.sh | sh
ollama serve
ollama pull phi4:latest
bash
# Full analysis with Ollama summarization
ai-video-analyzer analyze video.mp4 \
    --preset balanced \
    --whisper-model base \
    --ollama-model phi4:latest \
    --output-dir ./results

Output files

FileContents
<video>_analysis.jsonFull structured data: detections, captions, transcript, timings, preset, limitations
<video>_analysis.mdHuman-readable Markdown report with tables
report.txtPlain-text legacy report

Benchmarking and diagnostics

bash
# Single preset benchmark
ai-video-analyzer benchmark "/path/to/video.mp4" --preset balanced

# Compare fast / balanced / quality side-by-side
ai-video-analyzer benchmark "/path/to/video.mp4" --compare

Example benchmark output (RTX 5080, 8.2 s clip, 8 frames selected):

text
  Video duration     : 8.2s
  Frames selected    : 8  (sampling: 0.08s)
  Model load+warmup  : 0.52s
  Detection runtime  : 0.15s  (18.6ms/frame, 53.8fps)
  Total detections   : 48
  Top labels         : dog(9), person(7)
  Model              : dfine-s

Real-world benchmark results are in reports/benchmarks/ in the repository. These numbers reflect a specific GPU and clip; actual results vary by hardware and video content.

Optional dependencies

DependencyPurposeHow to install
VisionServeX (required)D-FINE object detectionpip install -e ".[vision]"
WhisperSpeech transcriptionpip install -r pip_requirements.txt
BLIPScene captioningpip install -r pip_requirements.txt
PANNs CNN14Audio event detectionpip install -r pip_requirements.txt
ffmpegAudio extraction (required for audio stages)sudo apt install ffmpeg
OllamaLocal LLM summarizationollama.com/install.sh

Limitations and honest boundaries

COCO-80 class set: D-FINE detects objects from the COCO-80 class set. This does not include fire, smoke, or weather classes. D-FINE will return approximate visual matches (e.g. food textures for fire/smoke frames). The BLIP captioning stage handles such cases in natural language, but this is not equivalent to specialized fire or smoke detection.
ClaimStatus
Fire / smoke / weather detectionNot supported as object classes (COCO-80 limitation)
Human-level video understandingNot claimed — output is structured detection + captioning data
Real-time performanceNot benchmarked for streaming; designed for offline file analysis
Medical / security / compliance certificationNo certification — research/development tool
Guaranteed zero false positivesNot claimed — detection accuracy depends on model and scene
Cloud deploymentNot supported — offline-only design by intent

Relationship with VisionServeX

AI-Powered Video Analyzer uses VisionServeX as its primary detection backend. VisionServeX is a separate open-source project that provides a local-first Python framework for serving modern computer vision models — including D-FINE, SAM, DINOv2, RF-DETR, and others — through a unified Python API and HTTP gateway.

VisionServeX is licensed under Apache-2.0. AI-Powered Video Analyzer is MIT. Both are independent packages; VisionServeX has many uses beyond video analysis, and AI-Powered Video Analyzer depends on VisionServeX only for its detection stage.

A legacy YOLO-based detection backend (--backend legacy_yolo) is available via the [legacy-yolo] extra, but it is not recommended for new use. VisionServeX D-FINE is the default and supported backend.

Links and credits

Developed by Arash Sajjadi, University of Saskatchewan. D-FINE reference: Peng et al. (2024), arXiv:2410.13842.

Also explore