STAR-Bench

Probing Deep Spatio-Temporal Reasoning
as Audio 4D Intelligence

Zihan Liu^* · Zhikang Niu^* · Qiuyang Xiao · Zhisheng Zheng · Ruoqi Yuan · Yuhang Zang^†
Yuhang Cao · Xiaoyi Dong · Jianze Liang · Xie Chen · Leilei Sun · Dahua Lin · Jiaqi Wang^†

^* Equal Contribution. ^†Corresponding authors.

arXiv Code 🤗 Dataset

Introduction

Comparison among Benchmarks

We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce a STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perceptionsetting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories.Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps to humans and a capability hierarchy. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.

A comparative overview of our benchmark against other representative audio benchmarks is shown below.

Comparison among Benchmarks

Data Examples

🎧 Please wear headphones and listen carefully to the audio examples below in a quiet environment. 🔊✨

Foundational Acoustic Perception

Note: In the Options column, the bolded choice indicates the correct answer corresponding to each question example.

Absolute Perception Range

Attribute	Range	Question	Options
Pitch, Loudness	125 Hz - 8000 Hz, -10 dB - 110 dB	The audio you just heard is divided into two halves. Does a sound appear in the first half, the second half, or is it not present at all?	(A) The first half (B) The second half (C) It is not present at all (D) Unable to determine
Azimuth	0° - 360°	Given that 0° is directly in front and the angle increases clockwise, which azimuth range is the sound most likely coming from?	(A) Front-Right (0°-90°) (B) Back-Right (90°-180°) (C) Back-Left (180°-270°) (D) Front-Left (270°-360°) (E) Unable to determine
Elevation	-90° - 90°	Where does the sound seem to be coming from in terms of elevation, relative to ear level?	(A) Above ear level (B) Below ear level (C) At ear level (D)Unable to determine
Distance	0 meter - 10 meters	How far away does the sound seem to be?	(A) Near (within about 0-3 meters) (B) Medium (around 3-8 meters) (C) Far (more than 8 meters) (D) Unable to determine

Relative Discrimination Sensitivity

Attribute	Level	Question	Options
Pitch	0, 50, 100, 200, 400, 1200 (cents)	Which sound has a higher pitch: the first sound, the second sound, or are they the same?	(A) The first sound has a higher pitch (B) The second sound has a higher pitch (C) Both sounds are the same (D) Unable to determine
Loudness	0, 4, 8, 12, 24, 48 (dB)	Which sound is louder: the first sound, the second sound, or are they the same?	(A) The first sound is louder (B) The second sound is louder (C) Both sounds are the same (D) Unable to determine
Duration	0, 20, 50, 100, 150, 200 (%)	Which sound is longer: the first sound, the second sound, or are they the same?	(A) The first sound is longer (B) The second sound is longer (C) Both sounds are the same (D) Unable to determine
Azimuth	30, 60, 90, 120, 150, 180 (°)	Audio 1: Audio 2: Are Audio 1 and Audio 2 at the same azimuth? (Consider differences of less than 45° as the same.)	(A) Same (B) Different (C) Unable to determine
Elevation	15, 90, 120, 150 (°)	Audio 1: Audio 2: Which audio has the higher elevation angle? (Consider differences of less than 45° as the same.)	(A) Audio 1 is higher (B) Audio 2 is higher (C) Both are at the same elevation (D) Unable to determine
Distance	1-2, 4-5, 6-7, 8-9 (meters)	Audio 1: Audio 2: Which audio is farther away? (Consider differences of less than 3 meters as the same.)	(A) Audio 1 is farther away (B) Audio 2 is farther away (C) Both audios are the same (D) Unable to determine

Holistic Spatio-Temporal Reasoning

Temporal Reasoning

Continuous Processes — Object Spatial Motion

UNCUT AUDIO:

Question: You are a specialized sound event ordering expert. Please listen to the following three audio clips labeled clip 1, clip 2, and clip 3, and determine the most natural chronological order in which these sounds would typically occur in the real world.

clip 1:

clip 2:

clip 3:

Continuous Processes — In-Situ State Evolution

UNCUT AUDIO:

clip 1:

clip 2:

clip 3:

Discrete Event Sequences — Tool & Appliance Operation

UNCUT AUDIO:

clip 1:

clip 2:

clip 3:

Discrete Event Sequences — Daily Scene Scripts

UNCUT AUDIO:

clip 1:

clip 2:

clip 3:

Discrete Event Sequences — Event-Triggered Consequences

UNCUT AUDIO:

clip 1:

clip 2:

clip 3:

Spatial Reasoning

Native audio input:

Channel-wise audio input:

Audio 1:

Audio 2:

You are given a binaural recording: Audio 1 is the left-ear channel and Audio 2 is the right-ear channel.

Native audio input:

Channel-wise audio input:

Audio 1:

Audio 2:

You are given a binaural recording: Audio 1 is the left-ear channel and Audio 2 is the right-ear channel.

Native audio input:

Channel-wise audio input:

Audio 1:

Audio 2:

You are given a binaural recording: Audio 1 is the left-ear channel and Audio 2 is the right-ear channel.

Evaluation Results

Leaderboard

Evaluation results of various models on STAR-Bench v0.5 are shown below.
(The leaderboard for v1.0 will be released soon.)

Results

Error Analysis

Error distribution across temporal and spatial Tasks

Error Distribution

The sensitivity analysis in fine-grained perception

audiogram_curve

The ablation study on temporal reasoning

temporal ablation

Error Examples

clip 1:

clip 2:

clip 3:

UNCUT AUDIO:

clip 1:

clip 2:

clip 3:

UNCUT AUDIO:

clip 1:

clip 2:

clip 3:

UNCUT AUDIO:

clip 1:

clip 2:

clip 3:

UNCUT AUDIO:

clip 1:

clip 2:

clip 3:

UNCUT AUDIO:

clip 1:

clip 2:

clip 3:

UNCUT AUDIO:

Native audio input:

Channel-wise audio input:

Audio 1:

Audio 2:

You are given a binaural recording: Audio 1 is the left-ear channel and Audio 2 is the right-ear channel.

Insights

🔥 A clear capability hierarchy between the two groups. Closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning.
🔥 Enhancing dense audio captioning. Open-source models struggle to produce dense, fine-grained captions, which limits their perceptual sensitivity and ability to extract embedded knowledge. Bridging this gap is a crucial first step.
🔥 Improving multi-audio reasoning. Open-source models lag significantly in comparing, integrating, and grounding information across multiple audio clips.
🔥 Moving beyond channel-averaged audio preprocessing. The common practice of averaging multi-channel audio into a mono signal is a major bottleneck for spatial reasoning. Developing architectures that natively process multi-channel cues is essential for unlocking genuine spatial awareness.

Citation

@article{liu2025starbench,
      title={STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence}, 
      author={Liu, Zihan and Niu, Zhikang and Xiao, Qiuyang and Zheng, Zhisheng and Yuan, Ruoqi and Zang, Yuhang and Cao, Yuhang and Dong, Xiaoyi and Liang, Jianze and Chen, Xie and Sun, Leilei and Lin, Dahua and Wang, Jiaqi},
      journal={arXiv preprint arXiv:2510.24693},
      year={2025}
    }

STAR-Bench

Probing Deep Spatio-Temporal Reasoningas Audio 4D Intelligence

Introduction

Data Examples

Foundational Acoustic Perception

Absolute Perception Range

Relative Discrimination Sensitivity

Holistic Spatio-Temporal Reasoning

Temporal Reasoning

Continuous Processes — Object Spatial Motion

Continuous Processes — In-Situ State Evolution

Discrete Event Sequences — Tool & Appliance Operation

Discrete Event Sequences — Daily Scene Scripts

Discrete Event Sequences — Event-Triggered Consequences

Spatial Reasoning

Evaluation Results

Leaderboard

Error Analysis

Insights

Citation

Probing Deep Spatio-Temporal Reasoning
as Audio 4D Intelligence