VANTAGE-Bench

VANTAGE-Bench is organized around four reasoning pillars. Each pillar targets a capability that current VLMs handle well in internet-video settings, but struggle with when the camera stops moving.

#	Model	Overall	Spatial	Spatio-Temporal	Temporal	Semantic
🥇	Cosmos3-Super new NVIDIA · open-weight✓ verified	63.01	86.97	76.23	72.94	62.25	51.90	29.54	71.28	69.46
🥈	Gemini 3.1 Pro Google · proprietary✓ verified	61.91	76.15	67.17	71.84	64.35	45.69	35.78	69.83	71.88
🥉	Cosmos3-Nano new NVIDIA · open-weight✓ verified	60.67	74.11	75.57	74.83	59.21	48.04	31.42	68.88	68.95
4	Qwen3-VL-32B-Instruct· 32B Alibaba · open-weight✓ verified	55.05	64.33	72.39	75.62	44.19	49.71	29.40	60.04	71.30
5	Cosmos-Reason2-8B· 8B NVIDIA · open-weight✓ verified	54.18	83.88	66.88	68.60	37.69	47.30	32.50	64.09	67.95
6	Gemini 3.1 Flash Lite Google · proprietary✓ verified	51.34	69.88	52.62	63.68	46.68	37.72	32.62	54.83	68.03
7	Qwen3-VL-8B-Instruct· 8B Alibaba · open-weight✓ verified	50.07	59.93	73.29	68.56	33.12	44.32	29.64	59.39	66.44
8	Qwen3.5-27B· 27B Alibaba · open-weight✓ verified	49.25	85.46	76.35	75.32	24.40	36.88	26.97	55.34	67.95

Task	Pillar	Annotations	Annotation type	Media	Modality
Event Verification	Semantic	163	Binary event labels	163 videos	Video
Video QA	Semantic	1,195	4-choice MCQ questions	282 videos	Video
Referring Expressions	Spatial	3,276	Expression–box pairs	1,503 images	Image
Spatial Pointing	Spatial	1,005	4-choice coordinate MCQ	361 images	Image
Object Localization	Spatial	27,404	Bounding boxes	628 images	Image
Temporal Localization	Temporal	1,067	Temporal segment labels	203 videos	Video
Dense Video Captioning	Temporal	717	Timestamped event captions	104 videos	Video
Single Object Tracking	Spatio-Temp	200	Object trajectories (8–32 frames)	102 videos	Interleaved
Total	—	35,027	—	3,346	Image + Video

Pillar I

Semantic Understanding

Can the model understand what happened and why, not just what is visible?

Standard VLM benchmarks test whether a model can describe a scene. VANTAGE-Bench asks something harder: whether a model can reason causally about operational events in raw, unorchestrated footage — without the narrative structure that edited video provides. This means verifying that a tailgating incident actually occurred, or answering multi-step logical questions about untrimmed surveillance footage where the event of interest may occupy only seconds of a long recording.

Task	Metric	Modality
Event Verification	Macro F1	Video
Video QA	Top-1 Accuracy	Video

Pillar II

Spatial Understanding

Can the model localize the right object in a scene where everything looks the same?

Internet-video benchmarks evaluate spatial reasoning on well-lit, centered, distinct objects. Fixed-camera footage presents the opposite: dozens of identical pallets in a warehouse, a row of identical vehicles at an intersection, pedestrians in uniform from an overhead angle.

VANTAGE-Bench forces models to perform dense semantic disambiguation: grounding language to the correct object among many near-identical candidates, selecting precise coordinates, and detecting every instance of a class at once.

Task	Metric	Modality
Referring Expressions	mIoU	Image
Spatial Pointing	Top-1 Accuracy	Image
Object Localization	F1@0.5	Image

Pillar III

Temporal Understanding

Can the model find when something happened, not just whether it happened?

Existing temporal benchmarks use scripted, continuous human actions where the event dominates the timeline. Operational video is characterized by long quiescent periods; a warehouse camera may run for hours before a safety violation occurs.

VANTAGE-Bench requires models to search through this inactivity, predict exact event boundaries, and autonomously caption multiple events in chronological order with correct timing.

Task	Metric	Modality
Temporal Localization	mIoU	Video
Dense Video Captioning	SODAc	Video

Pillar IV

Spatio-Temporal Understanding

Novel

Can the model follow an object through time while preserving precise spatial grounding?

This is the hardest pillar and the one that exposes the deepest gap in current VLMs. Spatial reasoning and temporal reasoning are evaluated separately in every existing benchmark, but operational AI requires both simultaneously. VANTAGE-Bench introduces the first quantitative VLM tracking benchmark, filling a gap that no prior evaluation suite has addressed.

A model must not only know where an object is in a single frame, but maintain that spatial identity as the object moves, is partially occluded, or merges with similar-looking objects across dozens of frames.

Task	Metric	Modality
Single Object Tracking	Success AUC	Video (interleaved frames)

SOT presents frames as interleaved image tokens in a single context window rather than as a continuous video stream. Sequences contain 8, 16, or 32 frames.

Benchmark	Modality	# Media	# Annot.	Reasoning coverage	Annotation source
VideoMME	Video	900	2,700	Semantic only	Human
BLINK	Image	3,683	1,906	Spatial only	Human, Existing
RefCOCO avg.	Image	3,982	30,969	Spatial only	Human, Existing
ODinW-13	Image	4,608	10,966	Spatial only	Human, Existing
Charades-STA	Video	1,334	3,720	Temporal only	Human, PL
ActivityNet Cap.	Video	5,044	17,750	Temporal only	Human
VANTAGE-Bench	Image + Video	3,346	35,027	All four pillars	Human + Synthetic + PL

Task	Reference benchmark	Ref. score	VANTAGE score	Gap (∆)
Semantic Understanding
VQA	VideoMME	71.40	65.47	-5.93
Event Verification	MLVU	78.10	48.14	-29.96
Spatial Understanding
2D Pointing	BLINK	69.10	45.54	-23.56
Referring Expressions	RefCOCO	89.10	71.55	-17.55
Object Localization	ODinW-13	44.70	37.72	-6.98
Temporal Understanding
Temporal Localization	Charades-STA	56.00	41.35	-14.65
Spatio-Temporal Understanding
Single Object Tracking	No prior benchmark exists	n/a	31.44	n/a

VANTAGE-Bench

Introduction

Leaderboard

Infrastructure AI

Operational Domains

Dataset

Task Taxonomy & Metrics

Semantic Understanding

Spatial Understanding

Temporal Understanding

Spatio-Temporal Understanding

Benchmark Comparison

Does VANTAGE-Bench actually measure something different?

Citation

How to Evaluate

#	Model	Overall	Spatial			Spatio-Temporal	Temporal		Semantic
#	Model	Overall	Obj Loc	Ref Exp	Pointing	SOT	Temp Loc	DVC	Event Ver	VQA
🥇	Cosmos3-Super new NVIDIA · open-weight✓ verified	63.01	86.97	76.23	72.94	62.25	51.90	29.54	71.28	69.46
🥈	Gemini 3.1 Pro Google · proprietary✓ verified	61.91	76.15	67.17	71.84	64.35	45.69	35.78	69.83	71.88
🥉	Cosmos3-Nano new NVIDIA · open-weight✓ verified	60.67	74.11	75.57	74.83	59.21	48.04	31.42	68.88	68.95
4	Qwen3-VL-32B-Instruct· 32B Alibaba · open-weight✓ verified	55.05	64.33	72.39	75.62	44.19	49.71	29.40	60.04	71.30
5	Cosmos-Reason2-8B· 8B NVIDIA · open-weight✓ verified	54.18	83.88	66.88	68.60	37.69	47.30	32.50	64.09	67.95
6	Gemini 3.1 Flash Lite Google · proprietary✓ verified	51.34	69.88	52.62	63.68	46.68	37.72	32.62	54.83	68.03
7	Qwen3-VL-8B-Instruct· 8B Alibaba · open-weight✓ verified	50.07	59.93	73.29	68.56	33.12	44.32	29.64	59.39	66.44
8	Qwen3.5-27B· 27B Alibaba · open-weight✓ verified	49.25	85.46	76.35	75.32	24.40	36.88	26.97	55.34	67.95