STAR-Bench Logo STAR-Bench

Probing Deep Spatio-Temporal Reasoning
as Audio 4D Intelligence

Zihan Liu* · Zhikang Niu* · Qiuyang Xiao · Zhisheng Zheng · Ruoqi Yuan · Yuhang Zang
Yuhang Cao · Xiaoyi Dong · Jianze Liang · Xie Chen · Leilei Sun · Dahua Lin · Jiaqi Wang

* Equal Contribution. Corresponding authors.

Introduction

Comparison among Benchmarks

We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce a STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perceptionsetting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories.Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps to humans and a capability hierarchy. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.

A comparative overview of our benchmark against other representative audio benchmarks is shown below.

Comparison among Benchmarks

Data Examples

🎧 Please wear headphones and listen carefully to the audio examples below in a quiet environment. 🔊✨

Foundational Acoustic Perception

Note: In the Options column, the bolded choice indicates the correct answer corresponding to each question example.

Holistic Spatio-Temporal Reasoning

Temporal Reasoning


Spatial Reasoning

Evaluation Results

Leaderboard

Evaluation results of various models on STAR-Bench v0.5 are shown below.
(The leaderboard for v1.0 will be released soon.)

Results

Error Analysis

Error distribution across temporal and spatial Tasks

Error Distribution

The sensitivity analysis in fine-grained perception

audiogram_curve

The ablation study on temporal reasoning

temporal ablation

Error Examples

Insights

Citation

@article{liu2025starbench,
      title={STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence}, 
      author={Liu, Zihan and Niu, Zhikang and Xiao, Qiuyang and Zheng, Zhisheng and Yuan, Ruoqi and Zang, Yuhang and Cao, Yuhang and Dong, Xiaoyi and Liang, Jianze and Chen, Xie and Sun, Leilei and Lin, Dahua and Wang, Jiaqi},
      journal={arXiv preprint arXiv:2510.24693},
      year={2025}
    }