Law Professors Find AI Tutors Dramatically Outperform Peer Answers in Legal Education
Key Takeaways
- ▸LLMs rated 75.33% higher than peer tutoring in blind legal education evaluations
- ▸AI responses flagged as harmful far less often than peer answers (3.53% vs 12.06%)
- ▸LLM tutors performed comparably to the best human instructors in the study
Summary
A landmark study conducted by 16 U.S. law professors from Stanford Law School and partner institutions has found that large language models significantly outperform human peer tutoring in legal education. In a blinded evaluation of contracts courses, professors created 40 representative questions, provided model answers, and judged 2,918 anonymized comparisons between LLM responses and answers from their peers. The results decisively favored AI: LLMs received an average win rate of 75.33% compared to peer answers, with models performing at levels comparable to the best human instructors in the study.
Beyond raw performance, the research revealed that LLM responses were rarely flagged as harmful or problematic (3.53% of cases), compared to 12.06% for peer-provided answers—suggesting AI tutors produce more consistent, appropriate guidance. The professors' preferences remained uniform across evaluators, indicating the advantage reflected shared professional standards rather than individual bias.
Crucially, the researchers demonstrated that expert preferences could be scaled using AI-as-judge approaches, making it practical to evaluate new models without repeated expert review. This methodology could extend to other judgment-heavy domains beyond law, where a single ground truth doesn't exist but professional expertise can reliably assess quality.
- AI-as-judge methodology enables scalable evaluation across multiple models without repeated expert review
Editorial Opinion
This research challenges the assumption that AI excels only in narrow, fact-based domains. The finding that LLMs outperform human peers at legal reasoning—a domain requiring nuance, judgment, and argumentation—suggests AI tutoring could address real gaps in professional education access and consistency. However, the study raises equally important questions: Should human expertise be supplemented or supplemented by AI? And what happens to the educational value of peer-to-peer learning if AI becomes the default tutor?



