Research Shows LLMs Grade Essays Differently Than Humans, With Significant Alignment Gaps
Key Takeaways
- ▸LLMs show weak agreement with human essay grades, with scoring patterns that differ significantly from human raters
- ▸LLMs exhibit systematic biases: they overrate short essays and underrate longer essays with minor errors, opposite to human grading behavior
- ▸LLM grades are internally consistent with their feedback, but rely on different evaluation signals than humans use
Summary
A new research paper submitted to arXiv reveals that large language models do not grade essays in the same way humans do, despite being increasingly proposed as tools for automated essay scoring. The study, which evaluated models from the GPT and Llama families in their out-of-the-box settings without task-specific training, found that LLM-generated scores show only weak agreement with human grades and vary based on essay characteristics.
The research identified specific biases in LLM grading behavior: the models tend to assign higher scores to short or underdeveloped essays while giving lower scores to longer essays with minor grammatical or spelling errors—patterns opposite to human grading preferences. Notably, the scores generated by LLMs are consistent with the feedback they provide, suggesting their grading follows coherent internal logic. However, this logic relies on different signals than those used by human raters, resulting in limited alignment with human grading practices.
Despite these limitations, the authors suggest that LLMs can still be reliably used in supporting essay scoring systems, particularly given the consistency between their grades and explanatory feedback. The findings highlight both the promise and the pitfalls of using large language models as automated grading tools in educational settings.
- While not replacements for human graders, LLMs may have supporting roles in essay scoring when their limitations are understood
Editorial Opinion
This research raises important questions about the suitability of large language models for high-stakes educational assessment. While the consistency between LLM grades and feedback is encouraging for system reliability, the fundamental differences in grading logic compared to human evaluators could disadvantage certain writing styles and student populations if deployed without careful oversight. Educational institutions considering LLMs for essay scoring should treat these findings as a cautionary note—these models may be useful for preliminary screening or feedback generation, but human expertise remains essential for fair and equitable assessment.



