arxiv

EndoCoT
Scaling Endogenous Chain-of-Thought
Reasoning in Diffusion Models

Steering the generation trajectory with implicit reasoning steps — unlocking controllable, high-fidelity image synthesis through latent thought.

Xuanlang Dai · Yujie Zhou · Long Xing · Jiazi Bu · Xilin Wei · Yuhong Liu · Beichen Zhang · Kai Chen · Yuhang Zang  |  Shanghai AI Lab

Paper GitHub Checkpoints Dataset
Explore
01 — Method
How reasoning steers diffusion
Task-Specific Accuracy — EndoCoT vs. DiffThinker
Maze-8
100%
Maze-32
90% vs 65%
Sudoku-45
100%
Sudoku-35
95% vs 55%
VSP-Super-32
85%
Average Accuracy
Task-specific · all benchmarks
92.1%
+8.3pp over DiffThinker
Generalization to Novel Domains
DiffThinker
Std. Size ✓ Novel Size ✗ Novel Font ✗
EndoCoT
Std. Size ✓ Novel Size ✓ Novel Font ✓
Architecture Overview
EndoCoT Architecture

Figure 1: EndoCoT architecture. The CoT module generates reasoning tokens that dynamically condition ...

Key Innovations

🧠

Genuine Endogenous CoT

The first diffusion framework to enable genuine chain-of-thought reasoning through iterative latent state refinement, bypassing standard single-pass solutions.

💭

Demystifying Diffusion Reasoning

Localizes the source of reasoning via layer-wise sensitivity and attention entropy analysis, identifying the key bottlenecks limiting prior methods.

🔥

Empirical Gains on Reasoning and Editing Tasks

Achieves 25-40% improvements on complex visual reasoning benchmarks, enabling controllable inference-time scaling and clearer image editing trajectories.

Training Pipeline

1
Multimodal Encoding Prompt and image are encoded into latent embeddings
3
Supervised CoT Alignment MSE loss supervises the correctness of intermediate CoT representations
4
Image Generation Training Train MMDiT with denoising / reconstruction loss

Inference Pipeline

1
Iterative Reasoning Run VL reasoning for N steps in the latent space
2
Final Answer / Image Generation Decode final result from the last latent state
Generated samples
Reasoning evolution

Drag the timeline to explore how endogenous chain-of-thought works — from original to the final image.

step0
Step 0 Original Question
step2
Step 2
step5
Step 5
step6
Step 6
step20
Step 20 Final Answer
0
1
2
3
4
🔮
Initialization: Starting from Gaussian noise latent z_T ~ N(0, I). No structure yet — pure randomness across all channels.
🧠
CoT Reasoning: "Prompt mentions 'landscape' → detect horizon line as primary compositional anchor. Allocate attention to lower 60% for ground plane."
☀️
CoT Reasoning: "Prompt specifies 'dawn' → position sun in upper-right quadrant, warm color bias (orange/yellow) for directional light source."
🌿
CoT Reasoning: "Foreground needs detail anchors → add vegetation clusters. Asymmetric placement for natural composition. Darker values = closer."
Final Refinement: Converged to high-frequency details. All CoT constraints satisfied: compositional balance ✓, prompt adherence ✓, photorealistic lighting ✓.
Full Benchmark Comparison

Three settings: zero-shot (no fine-tuning), task-specific (separate model per task), and unified training (single model across all tasks). EndoCoT achieves state-of-the-art in both supervised settings.

Method Maze TSP Sudoku VSP-Super Avg
81632 121518 454035 1632
▼ Zero-Shot Baselines
ThinkGen00000044411005.1
ChronoEdit1126000602012118.8
Qwen3-VL-8B1000006446331011.1
Qwen-Image-Edit-25110000005055440011.7
▼ Task-Specific Training
Qwen3-VL-8B (SFT)5337059604399969861858.6
DiffThinker1001006576725910010055998083.8
EndoCoT (Ours)1001009077777310010095998592.1
▼ Unified Training — single model, all tasks simultaneously
DiffThinker9899666449341009999978477.1
EndoCoT (Ours)97985264554610088801008084.2