Benchmark Reveals Claude Opus and Gemini 3.1 Pro Excel at Advanced Language Proofreading

Key Takeaways

▸Claude Opus 4.6 achieved 100% accuracy across multiple instruction-following temperature settings on lexical consistency detection
▸Smaller models like Claude Fable 5 still perform competitively (93-100%), indicating efficiency gains don't require sacrificing language quality
▸Model configuration (temperature, instruction mode) significantly impacts performance on subtle language tasks

Source:

Hacker Newshttps://revise.io/errata-bench↗

Summary

A comprehensive benchmark evaluation has demonstrated that frontier language models, particularly Anthropic's Claude Opus 4.6 and Google's Gemini 3.1 Pro Preview, achieve exceptional performance on advanced proofreading and language consistency tasks. The study tested multiple model variants on 'lexical choice and confusability idiomaticity' detection—assessing the models' ability to identify subtle language inconsistencies, word confusability, and idiomatic expression accuracy. Claude Opus 4.6 achieved 100% accuracy across multiple configuration variants, while Claude Fable 5 (Anthropic's more efficient model) demonstrated strong performance at 93-100% accuracy. Google's Gemini 3.1 Pro Preview also achieved perfect scores on this specialized task. The results suggest that modern LLMs have reached a level of linguistic sophistication approaching professional editorial standards.

Multiple frontier models now exceed human-level performance on specialized language proofreading benchmarks

Editorial Opinion

These benchmark results underscore a critical inflection point: frontier LLMs are now reliable enough to serve as primary proofreading tools for many use cases. The fact that Claude Opus 4.6 achieved perfect accuracy on lexical and idiomatic consistency suggests AI-assisted editing workflows could handle demanding editorial work. However, real-world proofreading requires broader context—domain knowledge, style guides, and authorial intent—dimensions this benchmark doesn't capture. The results are impressive but shouldn't obscure the reality that human editors remain essential for nuanced, context-aware content refinement.

Anthropic

RESEARCH Anthropic2026-06-11

Benchmark Reveals Claude Opus and Gemini 3.1 Pro Excel at Advanced Language Proofreading

Key Takeaways

▸Claude Opus 4.6 achieved 100% accuracy across multiple instruction-following temperature settings on lexical consistency detection
▸Smaller models like Claude Fable 5 still perform competitively (93-100%), indicating efficiency gains don't require sacrificing language quality
▸Model configuration (temperature, instruction mode) significantly impacts performance on subtle language tasks

Source:

Hacker Newshttps://revise.io/errata-bench↗

Summary

Multiple frontier models now exceed human-level performance on specialized language proofreading benchmarks

Editorial Opinion

These benchmark results underscore a critical inflection point: frontier LLMs are now reliable enough to serve as primary proofreading tools for many use cases. The fact that Claude Opus 4.6 achieved perfect accuracy on lexical and idiomatic consistency suggests AI-assisted editing workflows could handle demanding editorial work. However, real-world proofreading requires broader context—domain knowledge, style guides, and authorial intent—dimensions this benchmark doesn't capture. The results are impressive but shouldn't obscure the reality that human editors remain essential for nuanced, context-aware content refinement.

Benchmark Reveals Claude Opus and Gemini 3.1 Pro Excel at Advanced Language Proofreading

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic Settles $1.5B Copyright Lawsuit, Sets Precedent for AI Training Data Rights

Anthropic Shares Three Design Patterns for Building Better AI Agents with Claude

Data Loss in Claude Code and OpenAI Codex: When AI Agents Delete User Files

Comments

Suggested

Optical Memory Link Could Boost AI in Robotics

Anthropic Settles $1.5B Copyright Lawsuit, Sets Precedent for AI Training Data Rights

Americans Doubt US AI Leadership, Fear AI Will Widen Global Inequality

Benchmark Reveals Claude Opus and Gemini 3.1 Pro Excel at Advanced Language Proofreading

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Anthropic Settles $1.5B Copyright Lawsuit, Sets Precedent for AI Training Data Rights

Anthropic Shares Three Design Patterns for Building Better AI Agents with Claude

Data Loss in Claude Code and OpenAI Codex: When AI Agents Delete User Files

Comments

Suggested

Optical Memory Link Could Boost AI in Robotics

Anthropic Settles $1.5B Copyright Lawsuit, Sets Precedent for AI Training Data Rights

Americans Doubt US AI Leadership, Fear AI Will Widen Global Inequality