OVO-S-Bench

A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Yifei Li1,2†, Pengyiang Liu3†, Yuhang Zang2*, Zhongyue Shi3, Qi Fu3, Hongye Hao3, Jiwen Lu1
1Tsinghua University, 2Shanghai AI Laboratory, 3Beihang University
Equal Contribution   *Project Leader
OVO-S-Bench overview (paper Fig. 1)

Overview of OVO-S-Bench. The benchmark evaluates streaming spatial understanding across four levels, from instantaneous egocentric perception and spatiotemporal context tracking to generative spatial reasoning and global topological mapping. The right panel summarizes representative model behavior across task families.

Abstract

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators (each also serving as a blind cross-reviewer) across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points (59.2 vs. 86.6), with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

Representative OVO-S-Bench examples (paper Fig. 2)

Representative OVO-S-Bench examples. Each card pairs a spatial question with visual evidence, illustrating the progression from current-view perception to allocentric mapping.

Four-Level Streaming Spatial Taxonomy

OVO-S-Bench organizes questions into four levels by the spatial state a model must access at query time. The levels progress from evidence directly available in the current view to allocentric map queries that require cross-viewpoint integration, reflecting a gradient of persistence and abstraction.

Taxonomy and benchmark statistics (paper Fig. 3)

Taxonomy and benchmark statistics. The left panel gives the four-level spatial taxonomy; the right panels report task-family counts, source distribution, and evidence-interval lengths by level.


  • L1 – Instantaneous Egocentric Perception. Questions answerable from frames near the query timestamp alone. Task families: egocentric metric perception (distance, scale, clearance, viewpoint height), local spatial relations (containment, occlusion, support, visible layout), and dynamic spatial perception (camera motion, object motion, relative speed).
  • L2 – Spatiotemporal Context Tracking. Evidence appeared in the video prefix but is no longer visible at query time. Task families: scene revisit recognition, spatial memory beyond the view, and chronological spatial memory.
  • L3 – Spatial Simulation and Reasoning. The model must operate on spatial structure rather than merely retrieve an observation. Task families: spatial simulation (reorientation, removal consequences, physical feasibility), spatiotemporal consistency verification, and spatial route planning.
  • L4 – Allocentric Spatial Mapping. Integrates the egocentric stream into an allocentric representation and queries its global structure. Task families: allocentric direction reasoning, topological structure reasoning, and trajectory-map alignment.

The released benchmark comprises 1,680 questions over 348 source videos from 9 datasets, organized into 30 canonical task types across four levels. Mean prefix at query time: 8.8 minutes. Evidence-span medians: L1 2.0 s, L2 36.8 s, L3 2.0 s, L4 278.7 s — reflecting the spatial persistence each level demands.

Benchmark Construction

Video sources. OVO-S-Bench draws from 9 publicly available sources covering five regimes: indoor walkthroughs (RoomTour3D), egocentric activities (Ego4D), outdoor/world scenes (Sekai, OmniWorld, YouTube walking tours), driving videos (CODa, Honda HDD), and spatially annotated 3D environments (ARKitScenes, VSI-Bench).

Human annotators write every item. Annotators with 3D-vision backgrounds choose clips with stable motion, clear viewpoints, and enough spatial variation for the target level. For each item, they record the video, task label, question, options, answer, query timestamp, and evidence interval. Some task types employ specialized construction techniques such as image editing to generate spatial-change contrasts.

Streaming setting. The answer must be derivable from the video prefix before the query timestamp. Annotators mark the shortest interval that contains the needed evidence and write distractors that are plausible under the visual context but wrong under the annotated evidence.

Quality control removes shortcuts. A text-only LLM probe flags items that leak the answer through wording, common sense, or option asymmetry. A second annotator then cross-reviews each item without seeing the original answer, checking that the answer and evidence interval are sufficient. Recurring problems are folded back into the annotation guideline.

Key Findings

Six observations about the current state of streaming spatial intelligence, from 38 evaluated systems on OVO-S-Bench.

27 pts
Significant gap with human performance
The strongest system Gemini-3.1-Pro reaches 59.2 overall, far below human experts under the same streaming protocol (86.6; 92.2 offline). The best open-source model Qwen3-VL-235B-A22B attains 53.6, trailing human-streaming by 33 points. The Random (31.3) and Text-Only (37.1) baselines fall below all general backbones, confirming the gap reflects genuine visual-streaming difficulty rather than language priors.
28 / 34
Allocentric mapping is the dominant bottleneck
L4 is the lowest-scoring level for 28 of 34 systems, with an average gap of 9.3% between L1–L3 and L4. Even the largest open-source backbones drop more than 10 points (Qwen3-VL-235B-A22B: 10.6; InternVL-3.5-241B-A28B: 13.8). The six exceptions all have L1 below 41, so their flipped ordering reflects degraded current-view perception rather than competent allocentric mapping.
+5.6
Closed-source advantage is narrow and uneven
The closed-source lead is only 5.6 points overall (Gemini-3.1-Pro 59.2 vs. Qwen3-VL-235B-A22B 53.6), narrower than the 10+ point gap reported on recent video and multimodal benchmarks. The gap is uneven across levels: it widens on memory-heavy L2 (+5.9) and narrows on L4 (+4.1); on L3, the best open-source backbone exceeds Gemini-3.1-Pro by 5.3 points (61.2 vs. 55.9).
13 / 15
Specialization hurts the backbone
No streaming-architecture or spatially fine-tuned variant outperforms its comparable general backbone, and 13 of 15 lag behind their own base on overall accuracy (median −2.0, range −18.4 to +0.5). L4 is the most uniformly damaged level: 13 of 15 methods regress on allocentric mapping (mean Δ = −6.1; Flash-VStream-7B −16.7, Cosmos-Reason1-7B −12.8).
+3.9 / −1.0
Chain-of-thought is double-edged
Across paired thinking-mode comparisons, explicit reasoning consistently helps L2 (mean Δ = +3.9, 8/9 pairs positive) but shows a small mean drop on L1 (mean Δ = −1.0, 6/9 pairs negative). A GPT-5.4 judge over wrong traces finds that 60–80% of CoT failures are mis-grounded visual evidence (non-visual + visual-content errors) in GLM-4.6V-Flash, Qwen3-VL, and InternVL-3.5.
r ≈ 0
Retention is not the bottleneck
For HERMES, StreamingTOM, and FluxMem, per-query Pearson correlation between Evidence Recall and correctness is essentially zero (r ∈ [−0.07, 0.00]). Neither an oracle-evidence sampler nor doubling the frame budget improves over uniform 128 frames by more than +0.3 points. The 27-point gap to human performance therefore does not reduce to a retrieval problem solvable by better frame selection or larger memory.

OVO-S-Bench Leaderboard

Main results under the streaming protocol (multiple-choice accuracy). Top three are shaded; bold marks the best non-baseline per column. Baselines and human anchors are unranked.
Model Params L1 L2 L3 L4 Overall Rank
Baselines & Controls
Random Baseline29.835.133.327.131.3
Text-Only (GPT-5.4)38.435.638.935.537.1
Human (streaming)93.281.086.479.286.6
Human (offline)97.086.294.289.292.2
Closed-source proprietary MLLMs
Gemini-3.1-Pro61.964.055.954.959.2🥇 1
GPT-5.454.657.650.840.550.95
Gemini-3.1-Flash-Lite54.152.254.142.850.87
Grok-4.1-Fast44.846.648.535.043.719
Open-source general video MLLMs
Qwen3-VL235B-A22B52.555.261.245.753.6🥈 2
Qwen3.5397B-A17B49.655.458.145.452.1🥉 3
Qwen3.527B51.555.252.447.751.74
InternVL-3.5241B-A28B55.655.751.640.550.96
InternVL-3.538B54.754.445.541.949.18
Qwen3-VL32B50.151.951.941.248.89
Qwen3-VL4B43.348.254.541.446.810
Qwen3.59B47.549.450.236.645.912
Qwen3.54B45.448.249.338.745.413
InternVL-3.58B45.945.847.239.344.616
Qwen2.5-VL7B40.745.545.944.744.217
GLM-4.6V-Flash9B44.648.046.633.843.222
Gemma-426B-A4B49.346.645.029.342.624
Gemma-4E4B40.942.842.832.339.730
Gemma-4E2B38.836.539.329.636.136
Streaming video MLLMs
StreamForest7B46.645.249.734.944.118
StreamingVLM7B38.750.541.841.243.023
Flash-VStream7B18.729.922.528.724.938
Token-compression and memory-based methods
FluxMem7B43.047.645.542.644.714
HERMES7B40.945.449.442.944.615
StreamingTOM7B37.248.238.733.539.431
InfiniPot-V7B39.135.741.940.639.332
Spatially fine-tuned MLLMs
VST-7B-SFT7B43.344.043.637.942.226
VST-7B-RL7B45.744.240.938.042.225
SenseNova-SI-1.58B42.142.442.732.840.029
Spatial-TTT2B38.735.441.032.737.033
Cambrian-S7B40.240.036.929.936.834
Spatial-MLLM7B35.739.234.436.336.435
Cambrian-S-LFP7B38.838.034.228.734.937
Embodied foundation models
RynnBrain8B45.350.347.342.746.411
VeBrain7B42.744.246.240.843.521
RoboBrain2.5-NV8B42.946.650.634.143.620
RoboBrain2.54B40.143.448.135.741.827
Cosmos-Reason17B44.843.745.531.941.528

Numbers replicated from the paper's main results table. The public dataset release is linked above; a submission portal and live leaderboard will follow.

Dataset Examples

BibTeX

@misc{li2026ovosbench,
  title         = {OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs},
  author        = {Li, Yifei and Liu, Pengyiang and Zang, Yuhang and Shi, Zhongyue and Fu, Qi and Hao, Hongye and Lu, Jiwen},
  year          = {2026},
  eprint        = {2606.03890},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.03890}
}