RNG-Bench: Reconstructive Non-Markov Games

Abstract

Hidden state, closed loop, controllable scale

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. Existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended.

We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games — Matching Pairs, where briefly revealed card identities must later be recalled by location, and 3D Maze, where egocentric views must be integrated into a spatial map — both run under a unified harness with three controlled difficulty axes (grid size, visual pattern, and observation modality), a head-to-head duel protocol that removes instance variance, and a Memory Gap metric that disentangles forgetting from poor action selection.

The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

GPT-5.4

Gemini-3.1-Pro

Kimi-K2.5

Qwen3.5-397B

Seed-2.0-Lite

Seed-2.1-Pro

Results

Frontier MLLMs are far from saturation

Single-player Matching Pairs at 10×10 (image, noise theme) and 3D Maze at 13×13 (no minimap, mean optimal path 60 steps).

Model	Matching Pairs 10×10				3D Maze 13×13
Model	PF%↓	IA%↓	Resp./Score↓	Score%↑	SR%↑	Explore%↑	Walls↓	Eff.%↑	GS%↑
Seed-2.1-Pro	6.3	4.1	7.9	64.6	30.0	35.4	12.9	29.1	32.0
GPT-5.4	0.0	4.3	8.0	62.3	20.0	32.3	3.2	75.7	30.5
Gemini-3.1-Pro	0.4	2.5	10.0	50.0	50.0	36.4	0.1	62.5	49.7
Seed-2.0-Lite	1.2	4.3	11.6	43.2	20.0	19.4	16.6	38.9	21.7
Kimi-K2.5	1.8	2.8	13.2	38.0	10.0	17.9	7.1	61.1	16.1
Qwen3.5-397B	0.0	3.0	19.7	25.3	0.0	21.0	9.9	0.0	10.5

Single-player. Score% = fraction of matched pairs; GS% = aggregate maze score (success rate, efficiency, exploration). PF/IA = parse-failure / invalid-action rates; Resp./Score = responses per matched pair; Eff. averaged over successful episodes only. Best per column in bold.

Duel — Matching Pairs (image, poker)

Each model plays 16 games against the other four (both player orders, two seeds). The ranking flips from single-player: Gemini-3.1-Pro wins every matchup by exploiting cards revealed by its opponent.

Model	Win%↑	W	T	L	Score%↑	ELO↑
Gemini-3.1-Pro	100.0	16	0	0	36.5	1803
GPT-5.4	50.0	7	2	7	25.3	1492
Qwen3.5-397B	46.7	7	1	8	18.0	1476
Kimi-K2.5	37.5	5	2	9	18.0	1423
Seed-2.0-Lite	15.6	2	1	13	12.3	1306

Single-player and duel rankings diverge. In 16 head-to-head matches on the same 8×10 boards, Gemini-3.1-Pro wins every matchup (averaging 14.6 matched pairs/game vs. GPT-5.4's 10.1), while single-player is topped by Seed-2.1-Pro and GPT-5.4 — duels reward exploiting opponent-revealed cards, a channel single-player cannot probe.
Performance drops sharply as the hidden state grows. Qwen3.5-397B falls from 90.6 % on 4×4 to 0.7 % on 12×12 Matching Pairs; its 3D-Maze Game Score peaks at 7×7 (66.7) and declines to 19.7 by 15×15.
Visual recognition, not history length, is the bottleneck. Qwen3.5-397B and Kimi-K2.5 solve Matching Pairs perfectly under text, but fall to 38.3 % and 43.3 % under noise-pattern images at the same scale.
The textual action trace is load-bearing. Removing the model's own action history collapses GPT-5.4 from 62.3 % to 15.3 % at 10×10 — even though every flip is visible in the board image — so the action trace is not redundant decoration.
Large headroom remains. An optimal policy needs only 3.24 responses per matched pair versus 7.9 for the strongest model — roughly 59 % fewer moves.

Score degrades smoothly as the hidden state grows.

Gemini-3.1-Pro consistently out-paces GPT-5.4 on duel boards.

Controllable difficulty

Four axes, held-out everything else

Both games share one harness and one strict parser, so a change along any axis maps to belief-state tracking rather than a parallel confound. Scaling the grid raises the hidden-state load and the episode length together — up to ~128K tokens and 350 images per episode.

Axis	Matching Pairs	3D Maze
Scale	board size 4×4 → 12×14	maze size 5×5 → 15×15
Modality	text · ASCII image · pattern image	text-symbolic · 2D patch · 3D scene
Pattern	poker · noise · textures · perlin · …	wall-style variants
External memory	re-show prior snapshots (oracle)	minimap on / off

Plus per-game interventions — action feedback, CoT, response budget (Matching Pairs); ask-output, history window (3D Maze) — a head-to-head duel protocol that removes instance variance, and a Memory Gap metric (oracle vs. normal) that separates forgetting from action choice.

Environments

Two complementary hidden-state regimes

Both games are simple but diagnostic POMDPs: rule misunderstanding is held fixed by in-prompt rules and a strict parser, so drops along any axis reflect belief-state tracking, not parallel confounds.

Matching Pairs

A rectangular grid of face-down card pairs. Each turn the model flips two cards: matched pairs disappear, unmatched pairs flip back. The hidden state is the set of identity-location bindings briefly seen and now hidden.

static categorical 4×4 → 12×14

Axes: board size · visual pattern (poker, noise, textures, …) · modality (text / image) · action feedback · CoT · response budget.

3D Maze

The agent navigates from start to goal in a procedurally generated maze, seeing only an egocentric first-person rendering and the dialogue history. The hidden state is the topology, visited cells, position, and orientation — built incrementally from local views.

dynamic spatial 5×5 → 15×15

Axes: maze size · minimap availability · ask-output prompting (externalize belief) · history window length.

The two environments: observation, action, hidden state, and optimal play

For each game: what the model observes, the action it takes, the hidden state it must reconstruct, and what optimal play looks like — plus the pattern / size / modality / memory variants.

Positioning

Why another game benchmark?

Prior LLM/VLM game and memory benchmarks rarely make non-Markov, remember-to-act the central, controllable axis: many expose the full state, others bundle hidden information with exploration / planning / social skills, and memory suites probe recall only with a post-hoc question. RNG-Bench is closed-loop, multimodal, non-Markov-focused, and scalable at once.

Benchmark	Eval	Multimodal	Closed-loop	NM-focus	Scalable	Max Ctx (K)	Max #Img
GameBench · fully-visible games	Agent	~	✓	✗	✗	6	1
AgentBench · agent suite	Agent	✗	✓	✗	✗	12	0
BALROG · RL POMDPs	Base	✓	✓	✗	~	16	1
AvalonBench · hidden roles	Agent	✗	✓	✗	✗	3.5	0
EMemBench · episodic memory QA	Both	✓	✗	✗	✗	20	4
RNG-Bench · ours	Base	✓	✓	✓	✓	128	350

✓ yes · ~ partial · ✗ no. Eval: raw model (Base) vs. wrapped harness (Agent). NM-focus: non-Markov recall as the central axis. A representative row per family is shown; see the paper for the full table.

Gameplay

A single Matching Pairs round

The model never sees the ground-truth board (right) — at every step it observes only the current flips, then must use history to choose the next two coordinates.

1Round start

2Flip A1

3Flip B1

★Oracle only

Metric

Memory Gap: forgetting vs. decision-making

For each model we run two conditions on the same instance:

Normal — the model sees only the current observation plus the in-context history.
Oracle — the true hidden state is injected into the prompt at every step.

MemoryGap(m) = (1 − S(m) / S^*(m)) × 100 %

A large gap localizes the bottleneck to belief-state reconstruction. A small gap points to perception, decision-making, or rule understanding.

Injecting external memory (a MemMap for Matching Pairs, a minimap for 3D Maze) recovers a large Memory Gap — 46–51 points on Matching Pairs and 31–41 on 3D Maze — confirming forgetting as the dominant bottleneck.

Training

Closing the gap with simulator rollouts

Because both environments are simulators, we can roll out fresh trajectories with known optimal actions and use them as supervision — training and evaluation use disjoint board / maze sizes and seeds, so no training instance recurs at test time.

Qwen3.5-9B (held-out scale)	Match Score%↑	Match Resp./Score↓	Maze SR%↑	Maze GS%↑
base	0.0	—	0.0	1.5
+ `opt32k` (optimal-policy)	14.6	14.7	0.0	5.0
+ `rmix32k` (optimal + model rollouts)	29.5	6.8	10.0	16.3

Evaluation sizes are strictly larger than the training pool. Optimal rollouts teach the rules; adding 6K filtered model rollouts (rmix32k) nearly halves response cost and gives the first non-zero maze success. The same checkpoint also transfers outward — EMemBench +5.2, a memory / spatial suite +3.4 (group mean) — with general multimodal capability preserved (+0.5 group mean).

Cite

BibTeX

@article{rngbench2026,
  title   = {Beyond the Current Observation: Evaluating Multimodal
             Language Models in Non-Markov Games},
  author  = {Ding, Shengyuan and Wei, Xilin and Fang, Xinyu and Duan, Haodong
             and Lin, Dahua and Wang, Jiaqi and Zang, Yuhang},
  journal = {arXiv preprint arXiv:2606.19338},
  year    = {2026},
}

Beyond the Current Observation: Evaluating Multimodal LLMs in Non-Markov Games