arxiv

EndoCoT
Scaling Endogenous Chain-of-Thought
Reasoning in Diffusion Models

Steering the generation trajectory with implicit reasoning steps — unlocking controllable, high-fidelity image synthesis through latent thought.

Xuanlang Dai · Yujie Zhou · Long Xing · Jiazi Bu · Xilin Wei · Yuhong Liu · Beichen Zhang · Kai Chen · Yuhang Zang | Shanghai AI Lab

Paper GitHub Checkpoints Dataset

Explore

01 — Method

How reasoning steers diffusion

Task-Specific Accuracy — EndoCoT vs. DiffThinker

Maze-8

100%

Maze-32

90% vs 65%

Sudoku-45

100%

Sudoku-35

95% vs 55%

VSP-Super-32

85%

Average Accuracy

Task-specific · all benchmarks

92.1%

+8.3pp over DiffThinker

Generalization to Novel Domains

DiffThinker

Std. Size ✓ Novel Size ✗ Novel Font ✗

EndoCoT

Std. Size ✓ Novel Size ✓ Novel Font ✓

Architecture Overview

Figure 1: EndoCoT architecture. The CoT module generates reasoning tokens that dynamically condition ...

Key Innovations

🧠

Genuine Endogenous CoT

The first diffusion framework to enable genuine chain-of-thought reasoning through iterative latent state refinement, bypassing standard single-pass solutions.

💭

Demystifying Diffusion Reasoning

Localizes the source of reasoning via layer-wise sensitivity and attention entropy analysis, identifying the key bottlenecks limiting prior methods.

🔥

Empirical Gains on Reasoning and Editing Tasks

Achieves 25-40% improvements on complex visual reasoning benchmarks, enabling controllable inference-time scaling and clearer image editing trajectories.

Training Pipeline

Multimodal Encoding Prompt and image are encoded into latent embeddings

Supervised CoT Alignment MSE loss supervises the correctness of intermediate CoT representations

Image Generation Training Train MMDiT with denoising / reconstruction loss

Inference Pipeline

Iterative Reasoning Run VL reasoning for N steps in the latent space

Final Answer / Image Generation Decode final result from the last latent state

02 — Results

Generated samples

8 * 8Maze

16 * 16Maze

32 * 32Maze

Sudoku EasyCoT

Sudoku MediumCoT

Sudoku HardCoT

VSP Scene ACoT

VSP Scene BCoT

VSP Scene CCoT

TSP Scene ACoT

TSP Scene BCoT

TSP Scene CCoT

03 — Generation Process

Reasoning evolution

Drag the timeline to explore how endogenous chain-of-thought works — from original to the final image.

Step 0 Original Question

Step 2

Step 5

Step 6

Step 20 Final Answer

🔮

Initialization: Starting from Gaussian noise latent z_T ~ N(0, I). No structure yet — pure randomness across all channels.

🧠

CoT Reasoning: "Prompt mentions 'landscape' → detect horizon line as primary compositional anchor. Allocate attention to lower 60% for ground plane."

☀️

CoT Reasoning: "Prompt specifies 'dawn' → position sun in upper-right quadrant, warm color bias (orange/yellow) for directional light source."

🌿

CoT Reasoning: "Foreground needs detail anchors → add vegetation clusters. Asymmetric placement for natural composition. Darker values = closer."

✨

Final Refinement: Converged to high-frequency details. All CoT constraints satisfied: compositional balance ✓, prompt adherence ✓, photorealistic lighting ✓.

04 — Quantitative Evaluation

Full Benchmark Comparison

Three settings: zero-shot (no fine-tuning), task-specific (separate model per task), and unified training (single model across all tasks). EndoCoT achieves state-of-the-art in both supervised settings.

Method	Maze			TSP			Sudoku			VSP-Super		Avg
Method	8	16	32	12	15	18	45	40	35	16	32	Avg
▼ Zero-Shot Baselines
ThinkGen	0	0	0	0	0	0	44	4	11	0	0	5.1
ChronoEdit	1	1	26	0	0	0	60	20	12	1	1	8.8
Qwen3-VL-8B	1	0	0	0	0	0	64	46	33	1	0	11.1
Qwen-Image-Edit-2511	0	0	0	0	0	0	50	55	44	0	0	11.7
▼ Task-Specific Training
Qwen3-VL-8B (SFT)	53	37	0	59	60	43	99	96	98	61	8	58.6
DiffThinker	100	100	65	76	72	59	100	100	55	99	80	83.8
EndoCoT (Ours)	100	100	90	77	77	73	100	100	95	99	85	92.1
▼ Unified Training — single model, all tasks simultaneously
DiffThinker	98	99	66	64	49	34	100	99	99	97	84	77.1
EndoCoT (Ours)	97	98	52	64	55	46	100	88	80	100	80	84.2