STAR-Bench Logo STAR-Bench

Probing Deep Spatio-Temporal Reasoning
as Audio 4D Intelligence

Zihan Liu* · Zhikang Niu* · Qiuyang Xiao · Zhisheng Zheng · Ruoqi Yuan · Yuhang Zang
Yuhang Cao · Xiaoyi Dong · Jianze Liang · Xie Chen · Leilei Sun · Dahua Lin · Jiaqi Wang

* Equal Contribution. Corresponding authors.

Introduction

Comparison among Benchmarks

We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce a STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perceptionsetting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories.Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps to humans and a capability hierarchy. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.

A comparative overview of our benchmark against other representative audio benchmarks is shown below.

Comparison among Benchmarks

Data Examples

🎧 Please wear headphones and listen carefully to the audio examples below in a quiet environment. 🔊✨

Foundational Acoustic Perception

Note: In the Options column, the bolded choice indicates the correct answer corresponding to each question example.

Holistic Spatio-Temporal Reasoning

Temporal Reasoning


Spatial Reasoning

Evaluation Results

Leaderboard

Evaluation results of various models on STAR-Bench v0.5 are shown below.
(The leaderboard for v1.0 will be released soon.)

Results

Error Analysis

Error distribution across temporal and spatial Tasks

Error Distribution

The sensitivity analysis in fine-grained perception

audiogram_curve

The ablation study on temporal reasoning

temporal ablation

Error Examples

Insights

Citation