Chicago Booth Researchers Develop Framework for Evaluating AI Detection Tools—Most Commercial Detectors Show Promise
Key Takeaways
- ▸Commercial AI detectors (GPTZero, Originality.ai, Pangram) perform reasonably well on medium and long text but lose accuracy below 50 words
- ▸Pangram achieved near-perfect accuracy in distinguishing AI from human writing, while open-source RoBERTa model performed substantially worse
- ▸Performance varies based on the LLM used to generate AI text, meaning detectors must be tested across multiple AI models to be reliable
Summary
Researchers from the University of Chicago Booth School of Business have developed a policy framework for evaluating AI detection tools, testing three commercial detectors (GPTZero, Originality.ai, and Pangram) and one open-source model (RoBERTa) on their ability to distinguish human-written from AI-generated text. The study analyzed approximately 2,000 human-written passages across six mediums—blogs, reviews, news articles, novels, restaurant reviews, and résumés—then tested detectors' ability to identify AI versions of the same content.
The researchers found significant variation in detector performance based on text length, the underlying language model used to generate AI text, and decision thresholds. All three commercial tools demonstrated reasonably strong accuracy on medium-length and long passages (200+ words), but accuracy dropped sharply on very short text under 50 words. Pangram achieved the best overall performance, while RoBERTa performed poorly—often no better than random guessing—leading researchers to conclude it is unsuitable for high-stakes applications.
By adjusting certainty thresholds, the researchers created a data-driven methodology for institutions to implement these tools according to their own risk tolerance. Organizations more concerned with false positives (incorrectly flagging human work as AI) can set higher thresholds, while those prioritizing detection can lower them. This framework provides practical guidance for schools, employers, and other institutions considering AI detection as part of their verification protocols.
- Chicago Booth's framework allows institutions to adjust decision thresholds based on their tolerance for false positives vs. false negatives
Editorial Opinion
This research provides valuable guidance for institutions considering AI detection tools, but the sharp drop in accuracy for short texts remains a significant limitation. Real-world applications like academic integrity checking and professional communication often involve brief responses where these tools currently underperform. The strong showing by commercial tools suggests the AI detection industry has matured beyond early hype, though institutions should carefully match tool capabilities to their specific use cases rather than treating any detector as a silver bullet.


