VANTAGE-Bench

Video ANalysis Tasks Across Generalized Environments

Clemson University, School of Computing

vantage.bench.competition@gmail.com

A multi-task benchmark for evaluating Vision-Language Models on real-world fixed-camera footage across Warehouse, Transportation, and Smart Spaces, spanning Spatial, Spatio-Temporal, Temporal, and Semantic understanding.

News

[2026-06-01]VANTAGE-Bench featured in Jensen Huang's keynote at NVIDIA GTC Taipei / Computex 2026 — Cosmos 3 named top open-weight model on VANTAGE-Bench for fixed-camera vision understanding
↗ Read NVIDIA blog post
[2026-05-27]Leaderboard live — zero-shot rankings published across all four reasoning pillars
[2026-05-27]Evaluation harness released — clone and run on GitHub
[2026-04-24]VANTAGE-Bench dataset released on Hugging Face

Introduction

We introduce VANTAGE-Bench, a benchmark for evaluating vision-language models on fixed-camera operational video. Instead of internet video, egocentric clips, or broadcast footage, it focuses on real-world scenes from warehouses, intersections, and public spaces where the camera remains fixed and the model must reason from a stable vantage point.

VANTAGE-Bench contains 35,027 expert-curated annotations across 3,346 media assets. It spans three operational domains and four reasoning pillars across both image and video. The benchmark stands out for its domain relevance, modality breadth, task diversity, and evaluation novelty, culminating in the first quantitative single-object tracking benchmark for VLMs in fixed-camera operational footage.

VANTAGE-Bench task taxonomy
Task taxonomy and distribution of VANTAGE-Bench across three operational domains and four reasoning pillars.

The architecture of VANTAGE-Bench is driven by the fundamental disconnect between how VLMs are currently evaluated and how operational video systems actually work. Existing video benchmarks rely on a cinematic prior, with dynamic and human-centric framing, and reduce evaluation to a single format, typically multiple choice. VANTAGE-Bench addresses both limitations. It uses footage from fixed-infrastructure cameras that remove internet-video priors and extends evaluation beyond multiple choice to include generative dense captioning, precise coordinate prediction, and continuous spatio-temporal tracking. These are tasks that cannot be solved by selecting from a predefined set of answers.

Leaderboard

Zero-shot evaluation of frontier and open-weight models across all four reasoning pillars. All models evaluated under identical conditions — greedy decoding, no chain-of-thought.

#ModelOverallSpatialSpatio-TemporalTemporalSemantic
Obj LocRef ExpPointingSOTTemp LocDVCEvent VerVQA
🥇
Cosmos3-Super
new
NVIDIA  ·  open-weight✓ verified
63.0186.9776.2372.9462.2551.9029.5471.2869.46
🥈
Gemini 3.1 Pro
Google  ·  proprietary✓ verified
61.9276.1567.1771.8464.3545.6935.7869.8371.88
🥉
Cosmos3-Nano
new
NVIDIA  ·  open-weight✓ verified
60.6774.1175.5774.8359.2148.0431.4268.8868.95
4
Qwen3-VL-32B-Instruct· 32B
Alibaba  ·  open-weight✓ verified
55.0564.3372.3975.6244.1949.7129.4060.0471.30
5
Cosmos-Reason2· 8B
NVIDIA  ·  open-weight✓ verified
54.1883.8866.8868.6037.6947.3032.5064.0967.95
6
Gemini 3.1 Flash Lite
Google  ·  proprietary✓ verified
51.3469.8852.6263.6846.6837.7232.6254.8368.03
7
Qwen3-VL-8B-Instruct· 8B
Alibaba  ·  open-weight✓ verified
50.0759.9373.2968.5633.1244.3229.6459.3966.44
8
Qwen3.5-27B· 27B
Alibaba  ·  open-weight✓ verified
49.2685.4676.3575.3224.4036.8826.9755.3467.95

Bold = best in column  ·  ✓ verified = independently confirmed  ·  Updated 2026-05-29

Submit your predictions →

Infrastructure AI

Infrastructure AI is a sub-domain of Physical AI focused on fixed-infrastructure cameras such as CCTV networks, elevated sensors, and wide-angle lenses. These systems are deployed for safety monitoring, access control, traffic understanding, and operational logging. Unlike Embodied AI, which centers on moving agents navigating environments, Infrastructure AI operates from a persistent, stationary vantage point.

VANTAGE-Bench is purpose-built for this setting. Its footage, drawn from warehouses, roads, and public spaces, comes from cameras that remain fixed. Models must reason from a wide-area, static perspective rather than from edited, human-centered video.

Frontier models are largely trained on internet-crawled data and are therefore biased toward standard photographic perspectives. Under fixed-camera conditions, this prior breaks down. Models must instead rely on spatial-temporal reasoning. A model can achieve state-of-the-art performance on internet video while still exhibiting dangerous performance deficits in the environments that matter most.

No motion cues

Fixed cameras produce no optical flow. Models trained on dynamic video cannot rely on motion to localize objects or events. They must reason spatially from static context alone.

Dense, multi-instance scenes

Distinguishing between dozens of identical pallets, vehicles, or pedestrians from an elevated oblique viewpoint requires precise spatial reasoning that internet-video training does not provide.

Sparse, safety-critical events

Events of interest occupy a tiny fraction of the timeline. Models must search through extended periods of inactivity to localize brief, high-stakes moments.

No egocentric framing

Without human-centric composition, standard photographic priors fail. Models must reason geometrically from wide-angle, elevated perspectives with no compositional guidance.

Operational Domains

VANTAGE-Bench evaluates models across three structurally distinct deployment environments. Each domain requires different reasoning capabilities and exposes different failure modes in current VLMs.

WWarehouse domain image
Warehouse

Dense logistics environments with structured layouts, repeated objects, and human-robot interaction. Footage from elevated fixed cameras captures forklift operations, pallet movements, worker activity, and access control.

Forklift trackingPallet localizationWorker detectionAccess controlRobot profiles
TTransportation domain image
Transportation

Roadside and intersection monitoring with multi-vehicle scenes. Fixed roadside cameras capture traffic flow, pedestrian crossings, and safety-critical events.

Vehicle detectionCollision verificationTraffic flowPedestrian crossing
SSSmart Spaces domain image
Smart Spaces

Unstructured public and semi-public environments with ambiguous human behavior. Models must reason about access control, tailgating, and crowd dynamics without structured interaction cues.

Crowd safetyAccess controlActivity detectionTailgating

Dataset

Available on Hugging Face: nvidia/PhysicalAI-VANTAGE-Bench →

VANTAGE-Bench was built around footage that is genuinely hard to source — specifically fixed-infrastructure cameras in real operational environments. Annotations were produced by trained professionals using domain-specific guidelines rather than crowdsourcing, and each annotation was reviewed by a secondary expert before inclusion.

Supplemental sources

Three targeted external sources supplement the core footage:

  • RefDrone : aerial drone imagery used for Referring Expressions, chosen for its elevated oblique perspective which mirrors fixed-camera conditions
  • PhysicalAI-SmartSpaces : multi-camera warehouse sequences used for Single Object Tracking
  • NVIDIA Omniverse DRIVE Sim : high-fidelity synthetic footage covering safety-critical collision scenarios absent from real-world data, used in approximately 20% of VQA and Temporal splits
TaskPillarAnnotationsAnnotation typeMediaModality
Event VerificationSemantic163Binary event labels163 videosVideo
Video QASemantic1,1954-choice MCQ questions282 videosVideo
Referring ExpressionsSpatial3,276Expression–box pairs1,503 imagesImage
Spatial PointingSpatial1,0054-choice coordinate MCQ361 imagesImage
Object LocalizationSpatial27,404Bounding boxes628 imagesImage
Temporal LocalizationTemporal1,067Temporal segment labels203 videosVideo
Dense Video CaptioningTemporal717Timestamped event captions104 videosVideo
Single Object TrackingSpatio-Temp200Object trajectories (8–32 frames)102 videosInterleaved
Total35,0273,346Image + Video
Privacy & ethics

All footage was collected in compliance with applicable privacy regulations. 70% of recordings were obtained with explicit informed consent from individuals; the remainder was captured in spaces with posted notice of camera operation, where presence constitutes acknowledgment of monitoring. All assets underwent automated PII obfuscation followed by manual human-in-the-loop verification before release. The dataset is released under the NVIDIA Evaluation Data License, which restricts use to evaluation and benchmarking and strictly prohibits biometric identification and demographic profiling.

Task Taxonomy & Metrics

VANTAGE-Bench is organized around four reasoning pillars. Each pillar targets a capability that current VLMs handle well in internet-video settings, but struggle with when the camera stops moving.

Pillar I

Semantic Understanding

Can the model understand what happened and why, not just what is visible?

Standard VLM benchmarks test whether a model can describe a scene. VANTAGE-Bench asks something harder: whether a model can reason causally about operational events in raw, unorchestrated footage — without the narrative structure that edited video provides. This means verifying that a tailgating incident actually occurred, or answering multi-step logical questions about untrimmed surveillance footage where the event of interest may occupy only seconds of a long recording.

TaskMetricModality
Event VerificationMacro F1Video
Video QATop-1 AccuracyVideo
Pillar II

Spatial Understanding

Can the model localize the right object in a scene where everything looks the same?

Internet-video benchmarks evaluate spatial reasoning on well-lit, centered, distinct objects. Fixed-camera footage presents the opposite: dozens of identical pallets in a warehouse, a row of identical vehicles at an intersection, pedestrians in uniform from an overhead angle.

VANTAGE-Bench forces models to perform dense semantic disambiguation: grounding language to the correct object among many near-identical candidates, selecting precise coordinates, and detecting every instance of a class at once.

TaskMetricModality
Referring ExpressionsmIoUImage
Spatial PointingTop-1 AccuracyImage
Object LocalizationF1@0.5Image
Pillar III

Temporal Understanding

Can the model find when something happened, not just whether it happened?

Existing temporal benchmarks use scripted, continuous human actions where the event dominates the timeline. Operational video is characterized by long quiescent periods; a warehouse camera may run for hours before a safety violation occurs.

VANTAGE-Bench requires models to search through this inactivity, predict exact event boundaries, and autonomously caption multiple events in chronological order with correct timing.

TaskMetricModality
Temporal LocalizationmIoUVideo
Dense Video CaptioningSODAcVideo
Pillar IV

Spatio-Temporal Understanding

Novel

Can the model follow an object through time while preserving precise spatial grounding?

This is the hardest pillar and the one that exposes the deepest gap in current VLMs. Spatial reasoning and temporal reasoning are evaluated separately in every existing benchmark, but operational AI requires both simultaneously. VANTAGE-Bench introduces the first quantitative VLM tracking benchmark, filling a gap that no prior evaluation suite has addressed.

A model must not only know where an object is in a single frame, but maintain that spatial identity as the object moves, is partially occluded, or merges with similar-looking objects across dozens of frames.

TaskMetricModality
Single Object TrackingSuccess AUCVideo (interleaved frames)

SOT presents frames as interleaved image tokens in a single context window rather than as a continuous video stream. Sequences contain 8, 16, or 32 frames.

Benchmark Comparison

How does VANTAGE-Bench fit into the existing evaluation landscape?

Most VLM benchmarks focus on a single reasoning dimension and typically cover only one modality. VANTAGE-Bench is the first benchmark to jointly evaluate all four reasoning dimensions across both image and video in a single suite.

Table 1 — Scope comparison

What this shows: how VANTAGE-Bench compares to the benchmarks it most directly relates to in terms of scale, modality coverage, and task diversity.

BenchmarkModality# Media# Annot.Reasoning coverageAnnotation source
VideoMMEVideo9002,700Semantic onlyHuman
BLINKImage3,6831,906Spatial onlyHuman, Existing
RefCOCO avg.Image3,98230,969Spatial onlyHuman, Existing
ODinW-13Image4,60810,966Spatial onlyHuman, Existing
Charades-STAVideo1,3343,720Temporal onlyHuman, PL
ActivityNet Cap.Video5,04417,750Temporal onlyHuman
VANTAGE-BenchImage + Video3,34635,027All four pillarsHuman + Synthetic + PL

PL = programmatically generated labels from human-verified annotations Table 1: VANTAGE-Bench is the only benchmark spanning all four reasoning dimensions across both image and video modalities.

Does VANTAGE-Bench actually measure something different?

Yes, and by a significant margin. The table below compares the same model (Qwen3-VL-8B) on published scores on standard consumer-centric benchmarks against its scores on equivalent tasks in VANTAGE-Bench.

Table 2 — The performance gap

Qwen3-VL-8B zero-shot scores on the standard reference benchmark for each task vs. its score on the equivalent VANTAGE-Bench task. A negative gap means the model performs worse on VANTAGE-Bench. All scores are scaled 0–100.

TaskReference benchmarkRef. scoreVANTAGE scoreGap (∆)
Semantic Understanding
VQAVideoMME71.4065.47-5.93
Event VerificationMLVU78.1048.14-29.96
Spatial Understanding
2D PointingBLINK69.1045.54-23.56
Referring ExpressionsRefCOCO89.1071.55-17.55
Object LocalizationODinW-1344.7037.72-6.98
Temporal Understanding
Temporal LocalizationCharades-STA56.0041.35-14.65
Spatio-Temporal Understanding
Single Object TrackingNo prior benchmark existsn/a31.44n/a

Table 2: Performance gap for Qwen3-VL-8B across tasks. Negative values indicate degradation on VANTAGE-Bench relative to the standard reference benchmark for that task. The largest drops occur in Event Verification (−29.96) and 2D Pointing (−23.56), confirming that fixed-camera footage breaks priors that models rely on in consumer-centric settings. Single Object Tracking has no reference score because no prior VLM tracking benchmark exists. * Qwen3-VL-8B scored 31.44 AUC on SOT; best overall is Gemini 3.1 Pro at 64.35 AUC.

Citation

If you use VANTAGE-Bench, the leaderboard, or its evaluation resources in your research, please cite:

BibTeX
@misc{vantagebench2026,
  title        = {VANTAGE-Bench: A Benchmark for Vision-Language Models on Fixed-Camera Infrastructure AI},
  author       = {{VANTAGE-Bench Team}},
  year         = {2026},
  howpublished = {\url{https://github.com/Clemson-Capstone/VANTAGE-Bench}},
  note         = {Benchmark, dataset, evaluation framework, and public leaderboard. Leaderboard: https://huggingface.co/spaces/clemson-computing/VANTAGE-Bench-Leaderboard}
}

How to Evaluate

To evaluate your model on VANTAGE-Bench, use the official evaluation harness. The harness handles data loading, prompt formatting, and inference, and exports predictions in the required LLaVA submission format automatically.

Once inference is complete, archive your prediction files and submit through the submission portal. Our server scores predictions against held-out ground truth. Results are emailed to you after evaluation.

Full setup instructions, task-specific configurations, and format documentation are in the GitHub repository. Per-task prediction schemas are documented on the submission page →

GitHub Submit your predictions