Steering the generation trajectory with implicit reasoning steps — unlocking controllable, high-fidelity image synthesis through latent thought.
Figure 1: EndoCoT architecture. The CoT module generates reasoning tokens that dynamically condition ...
The first diffusion framework to enable genuine chain-of-thought reasoning through iterative latent state refinement, bypassing standard single-pass solutions.
Localizes the source of reasoning via layer-wise sensitivity and attention entropy analysis, identifying the key bottlenecks limiting prior methods.
Achieves 25-40% improvements on complex visual reasoning benchmarks, enabling controllable inference-time scaling and clearer image editing trajectories.
Drag the timeline to explore how endogenous chain-of-thought works — from original to the final image.
Three settings: zero-shot (no fine-tuning), task-specific (separate model per task), and unified training (single model across all tasks). EndoCoT achieves state-of-the-art in both supervised settings.
| Method | Maze | TSP | Sudoku | VSP-Super | Avg | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 16 | 32 | 12 | 15 | 18 | 45 | 40 | 35 | 16 | 32 | ||
| ▼ Zero-Shot Baselines | ||||||||||||
| ThinkGen | 0 | 0 | 0 | 0 | 0 | 0 | 44 | 4 | 11 | 0 | 0 | 5.1 |
| ChronoEdit | 1 | 1 | 26 | 0 | 0 | 0 | 60 | 20 | 12 | 1 | 1 | 8.8 |
| Qwen3-VL-8B | 1 | 0 | 0 | 0 | 0 | 0 | 64 | 46 | 33 | 1 | 0 | 11.1 |
| Qwen-Image-Edit-2511 | 0 | 0 | 0 | 0 | 0 | 0 | 50 | 55 | 44 | 0 | 0 | 11.7 |
| ▼ Task-Specific Training | ||||||||||||
| Qwen3-VL-8B (SFT) | 53 | 37 | 0 | 59 | 60 | 43 | 99 | 96 | 98 | 61 | 8 | 58.6 |
| DiffThinker | 100 | 100 | 65 | 76 | 72 | 59 | 100 | 100 | 55 | 99 | 80 | 83.8 |
| EndoCoT (Ours) | 100 | 100 | 90 | 77 | 77 | 73 | 100 | 100 | 95 | 99 | 85 | 92.1 |
| ▼ Unified Training — single model, all tasks simultaneously | ||||||||||||
| DiffThinker | 98 | 99 | 66 | 64 | 49 | 34 | 100 | 99 | 99 | 97 | 84 | 77.1 |
| EndoCoT (Ours) | 97 | 98 | 52 | 64 | 55 | 46 | 100 | 88 | 80 | 100 | 80 | 84.2 |