We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce a STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perceptionsetting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories.Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps to humans and a capability hierarchy. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
A comparative overview of our benchmark against other representative audio benchmarks is shown below.
STAR-Bench