SketchVLM Enables ChatGPT and Gemini to Draw Visual Explanations on Complex Interfaces
Key Takeaways
- ▸SketchVLM enables VLMs to produce editable SVG overlays on images instead of text-only explanations
- ▸Framework improves visual reasoning accuracy by up to 28.5 points across benchmark tasks
- ▸Training-free approach works with existing VLMs without requiring expensive retraining
Summary
Researchers have introduced SketchVLM, a training-free, model-agnostic framework that enables vision-language models (VLMs) like ChatGPT and Google's Gemini to produce editable SVG overlays on images to visually explain their reasoning. Unlike traditional text-only responses from modern VLMs, SketchVLM allows these AI models to draw annotations such as labels, lines, and shapes directly on input images, making their explanations more intuitive and easier for users to verify.
The framework was evaluated across six benchmarks covering visual reasoning tasks (maze navigation, trajectory prediction, object counting) and drawing tasks (part labeling, connecting-the-dots, shape drawing). SketchVLM demonstrated significant improvements, achieving up to a 28.5-point accuracy increase in visual reasoning tasks and up to 48.3% improvement in sketch quality compared to existing baselines. The overlays are non-destructive and editable, enabling iterative refinement through multi-turn human-AI interactions.
This addresses a fundamental limitation of current VLMs—their inability to visually indicate their focus or reasoning process when analyzing images. By enabling visual explanations through drawing, the framework enhances user trust and understanding, particularly for spatial reasoning, software navigation, and verification tasks.
- Non-destructive, editable sketches enable iterative human-AI collaboration
- Particularly useful for spatial reasoning, software navigation, and visual verification
Editorial Opinion
SketchVLM represents a meaningful step toward more transparent and verifiable AI reasoning. By enabling VLMs to visually demonstrate their analysis rather than relying solely on opaque text descriptions, it addresses a critical usability gap in AI adoption. The training-free, model-agnostic approach is particularly elegant, working with existing models without expensive retraining. Real-world impact will ultimately depend on seamless integration into ChatGPT and Gemini, and whether users find visual annotations genuinely more trustworthy than text alone.



