Cloning Bench: New Benchmark for Evaluating AI Agents on Visual Website Cloning
Key Takeaways
- ▸Cloning Bench establishes a standardized evaluation framework for measuring AI agent capability in visual website replication, with performance metrics based on structural similarity scoring
- ▸The benchmark supports multiple leading AI models (Claude, Codex, Gemini, GLM) in reproducible containerized environments, enabling fair comparative analysis across different AI systems
- ▸Agents must understand complex web design through DOM structure, accessibility trees, and CSS styling to rebuild interfaces as proper React components rather than simply copying reference materials
Summary
Researchers have introduced Cloning Bench, a novel benchmark designed to evaluate how well autonomous AI agents can replicate the visual design of real websites. The benchmark tasks agents with analyzing a reference recording of a website (starting with Slack) and building a React front-end that matches it visually. Agents operate in isolated Docker containers with access to browser automation, visual testing tools, and reference materials, with performance measured using SSIM (Structural Similarity Index) scores against original screenshots over a 6-hour evaluation period.
The benchmark framework supports multiple AI models including Claude (Anthropic), Codex (OpenAI), Gemini (Google), and GLM, running in containerized environments with Node.js, Python, Chromium, and specialized browser automation tools. Each agent enters an iterative test-fix loop where it studies reference DOM snapshots and accessibility trees, builds React components, captures screenshots, analyzes visual diffs, and iteratively improves the clone to maximize SSIM scores.
Cloning Bench provides comprehensive reference data including video recordings, screenshots, full HTML snapshots, accessibility trees, computed CSS values, and deduplicated assets. The benchmark includes two key testing tools: site-test for visual compliance testing and lookatdiff for LLM-powered diff analysis, enabling standardized evaluation of how well different AI agents can understand and replicate complex web UI designs.
- The open benchmark infrastructure with visual testing tools and detailed reference datasets provides the research community with reproducible evaluation methodology for autonomous web development agents
Editorial Opinion
Cloning Bench represents an important step toward standardized evaluation of autonomous AI agents on real-world web development tasks. By focusing on visual fidelity and requiring agents to understand and recreate proper React components rather than copying reference materials, the benchmark addresses practical challenges in AI-assisted development. This work could help identify which AI models excel at understanding visual design systems and translating them into maintainable code—a critical capability as AI coding assistants become more sophisticated.

