Researcher Tests AI Models' Ability to Interact With Touchscreens—And Finds Significant Gaps
Key Takeaways
- ▸AI models struggle to reliably interact with smartphone interfaces using only human-available actions, despite initial success appearances
- ▸Older fine-tuned models (GPT-5.3-Codex) outperformed newer general-purpose models (GPT-5.4) on touchscreen tasks, suggesting potential tradeoffs in model design
- ▸Rigorous testing methodology with deterministic verification and replay tools is critical to avoid publishing misleading performance metrics
Summary
A researcher conducted extensive testing on AI models' ability to interact with smartphone touchscreens using only human-like actions—tapping, swiping, long-pressing, and reading the screen. Across 1,700+ runs on four basic Android tasks, the evaluation revealed that current large language models struggle with reliable touchscreen interaction, despite what initial success rates might suggest. The study tested models from OpenAI (GPT-5.4, GPT-5.3-Codex, GPT-5.1-Codex), Google (Gemini variants), and Anthropic (Opus and Sonnet), constraining them to only the actions available to a human user without special system privileges or accessibility APIs.
A key finding emerged when comparing models: OpenAI's GPT-5.3-Codex, which is fine-tuned for agentic tool use, outperformed the more general-purpose GPT-5.4 on three out of four tasks—suggesting either a regression or a deliberate performance tradeoff. The researcher emphasizes that the most valuable aspect of this work is the methodology, not just the results, highlighting three critical principles: tools for visibility are essential, deterministic verification prevents flawed data, and flaky tests undermine confidence in results. This research underscores fundamental limitations in spatial reasoning, multi-step planning, and robustness to visual variation in current AI agents.
- Current AI agents lack robust spatial reasoning and error recovery; they confidently complete incorrect actions without backtracking
Editorial Opinion
This research highlights a crucial blind spot in AI agent development: models can sound confident while executing completely wrong actions. The finding that older, specialized models outperformed newer general-purpose ones is particularly significant, suggesting that the current approach to building general AI agents may sacrifice practical reliability for breadth. More importantly, the methodology—emphasizing deterministic testing, tool visibility, and stability over single-run success—should become the standard for evaluating AI systems in real-world applications where hallucinations and silent failures are unacceptable.



