RNG-Bench · Reconstructive Non-Markov Games

Beyond the Current Observation:
Evaluating Multimodal LLMs in Non-Markov Games

Remember · Reconstruct · Act

Shengyuan Ding1,2,3,*, Xilin Wei1,*, Xinyu Fang4,*, Haodong Duan5,†,
Dahua Lin3,5, Jiaqi Wang2,†, Yuhang Zang3,†

1Fudan University   2Shanghai Innovation Institute   3Shanghai AI Laboratory
4Zhejiang University   5The Chinese University of Hong Kong
*Equal contribution   Corresponding authors

4Controllable axes
~128KMax context tokens
~350Images / episode
5Frontier MLLMs
RNG-Bench: Markov vs. Non-Markov, and the two-game test suite

(a) In Markov games (Go, chess) the visible state determines the move; in Non-Markov games two identical-looking states need different actions once their histories differ. (b) RNG-Bench instantiates this as Matching Pairs and 3D Maze, controllable along scale, modality, pattern, and external memory.

Abstract

Hidden state, closed loop, controllable scale

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. Existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended.

We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games — Matching Pairs, where briefly revealed card identities must later be recalled by location, and 3D Maze, where egocentric views must be integrated into a spatial map — both run under a unified harness with three controlled difficulty axes (grid size, visual pattern, and observation modality), a head-to-head duel protocol that removes instance variance, and a Memory Gap metric that disentangles forgetting from poor action selection.

The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

OpenAIGPT-5.4
GoogleGemini-3.1-Pro
KimiKimi-K2.5
QwenQwen3.5-397B
ByteDanceSeed-2.0-Lite
Results

Frontier MLLMs are far from saturation

Single-player Matching Pairs at 10×10 (image, noise theme) and 3D Maze at 13×13 (no minimap, mean optimal path 60 steps).

Model Matching Pairs 10×10 3D Maze 13×13
PF%↓IA%↓Resp./Score↓Score%↑ SR%↑Explore%↑Walls↓Eff.%↑GS%↑
GPT-5.4 0.04.38.0162.3 20.032.33.275.730.5
Gemini-3.1-Pro 0.42.510.0050.0 50.036.40.162.549.7
Seed-2.0-Lite 1.24.311.5743.2 20.019.416.638.921.7
Kimi-K2.5 1.82.813.1638.0 10.017.97.161.116.1
Qwen3.5-397B 0.03.019.7425.3 0.021.09.90.010.5

Single-player. Score% = fraction of matched pairs; GS% = aggregate maze score (success rate, efficiency, exploration). PF/IA = parse-failure / invalid-action rates; Resp./Score = responses per matched pair; Eff. averaged over successful episodes only. Best per column in bold.

Duel — Matching Pairs (image, poker)

Each model plays 16 games against the other four (both player orders, two seeds). The ranking flips from single-player: Gemini-3.1-Pro wins every matchup by exploiting cards revealed by its opponent.

ModelWin%↑WTLScore%↑ELO↑
Gemini-3.1-Pro100.0160036.51803
GPT-5.450.072725.31492
Qwen3.5-397B46.771818.01476
Kimi-K2.537.552918.01423
Seed-2.0-Lite15.6211312.31306
Scale sweep across grid sizes

Score degrades smoothly as the hidden state grows.

Gemini-vs-GPT duel trajectories

Gemini-3.1-Pro consistently out-paces GPT-5.4 on duel boards.

Controllable difficulty

Four axes, held-out everything else

Both games share one harness and one strict parser, so a change along any axis maps to belief-state tracking rather than a parallel confound. Scaling the grid raises the hidden-state load and the episode length together — up to ~128K tokens and 350 images per episode.

AxisMatching Pairs3D Maze
Scaleboard size 4×4 → 12×14maze size 5×5 → 15×15
Modalitytext · ASCII image · pattern imagetext-symbolic · 2D patch · 3D scene
Patternpoker · noise · textures · perlin · …wall-style variants
External memoryre-show prior snapshots (oracle)minimap on / off

Plus per-game interventions — action feedback, CoT, response budget (Matching Pairs); ask-output, history window (3D Maze) — a head-to-head duel protocol that removes instance variance, and a Memory Gap metric (oracle vs. normal) that separates forgetting from action choice.

Environments

Two complementary hidden-state regimes

Both games are simple but diagnostic POMDPs: rule misunderstanding is held fixed by in-prompt rules and a strict parser, so drops along any axis reflect belief-state tracking, not parallel confounds.

Matching Pairs

A rectangular grid of face-down card pairs. Each turn the model flips two cards: matched pairs disappear, unmatched pairs flip back. The hidden state is the set of identity-location bindings briefly seen and now hidden.

static categorical 4×4 → 12×14

Axes: board size · visual pattern (poker, noise, textures, …) · modality (text / image) · action feedback · CoT · response budget.

3D Maze

The agent navigates from start to goal in a procedurally generated maze, seeing only an egocentric first-person rendering and the dialogue history. The hidden state is the topology, visited cells, position, and orientation — built incrementally from local views.

dynamic spatial 5×5 → 15×15

Axes: maze size · minimap availability · ask-output prompting (externalize belief) · history window length.

The two environments: observation, action, hidden state, and optimal play

For each game: what the model observes, the action it takes, the hidden state it must reconstruct, and what optimal play looks like — plus the pattern / size / modality / memory variants.

Positioning

Why another game benchmark?

Prior LLM/VLM game and memory benchmarks rarely make non-Markov, remember-to-act the central, controllable axis: many expose the full state, others bundle hidden information with exploration / planning / social skills, and memory suites probe recall only with a post-hoc question. RNG-Bench is closed-loop, multimodal, non-Markov-focused, and scalable at once.

Benchmark Eval Multimodal Closed-loop NM-focus Scalable Max Ctx (K) Max #Img
GameBench · fully-visible games Agent~61
AgentBench · agent suite Agent120
BALROG · RL POMDPs Base~161
AvalonBench · hidden roles Agent3.50
EMemBench · episodic memory QA Both204
RNG-Bench · ours Base128350

yes · ~ partial · no. Eval: raw model (Base) vs. wrapped harness (Agent). NM-focus: non-Markov recall as the central axis. A representative row per family is shown; see the paper for the full table.

Gameplay

A single Matching Pairs round

The model never sees the ground-truth board (right) — at every step it observes only the current flips, then must use history to choose the next two coordinates.

Round start 1Round start
First flip revealed 2Flip A1
Both flips revealed 3Flip B1
Ground-truth board (oracle) Oracle only
Metric

Memory Gap: forgetting vs. decision-making

For each model we run two conditions on the same instance:

  • Normal — the model sees only the current observation plus the in-context history.
  • Oracle — the true hidden state is injected into the prompt at every step.
MemoryGap(m) = (1 − S(m) / S*(m)) × 100 %

A large gap localizes the bottleneck to belief-state reconstruction. A small gap points to perception, decision-making, or rule understanding.

Memory Gap chart

Injecting external memory (a MemMap for Matching Pairs, a minimap for 3D Maze) recovers a large Memory Gap — 46–51 points on Matching Pairs and 31–41 on 3D Maze — confirming forgetting as the dominant bottleneck.

Training

Closing the gap with simulator rollouts

Because both environments are simulators, we can roll out fresh trajectories with known optimal actions and use them as supervision — training and evaluation use disjoint board / maze sizes and seeds, so no training instance recurs at test time.

Qwen3.5-9B (held-out scale) Match Score%↑Match Resp./Score↓Maze SR%↑Maze GS%↑
base0.00.01.5
+ opt32k (optimal-policy)14.614.70.05.0
+ rmix32k (optimal + model rollouts)29.56.810.016.3

Evaluation sizes are strictly larger than the training pool. Optimal rollouts teach the rules; adding 6K filtered model rollouts (rmix32k) nearly halves response cost and gives the first non-zero maze success. The same checkpoint also transfers outward — EMemBench +5.2, a memory / spatial suite +3.4 (group mean) — with general multimodal capability preserved (+0.5 group mean).

Cite

BibTeX

@article{rngbench2026,
  title   = {Beyond the Current Observation: Evaluating Multimodal
             Language Models in Non-Markov Games},
  author  = {Ding, Shengyuan and Wei, Xilin and Fang, Xinyu and Duan, Haodong
             and Lin, Dahua and Wang, Jiaqi and Zang, Yuhang},
  journal = {arXiv preprint arXiv:2606.19338},
  year    = {2026},
}