Research Exposes How Major LLMs Generate Correlated Fake Experts That Infiltrate Academic Publishing
Key Takeaways
- ▸Major LLMs (Claude, Gemini, GPT) generate correlated fictional expert identities that are model-family-specific and reproducible across independent generations
- ▸1,655 ghost-authored papers with valid DataCite DOIs have infiltrated academic repositories, with 991 registered in a single month—making them harvestable by scholarly aggregators
- ▸These name priors appear to be intentionally suppressed at model release boundaries, suggesting AI companies identified the problem but have not fully resolved it
Summary
A new arXiv research paper documents a troubling phenomenon: major large language models consistently generate the same fictional expert identities across independent generations, and these ghost authors are now appearing in real academic repositories with valid DOIs. The research identifies 'correlated character ensembles'—specific name pairs and trios that appear far more frequently than chance would suggest—that are consistent within each model family. Anthropic's Claude reliably produces Elena Vasquez, Marcus Chen, and Amara Okafor as fictional experts; Google's Gemini generates Aris Thorne and Lena Petrova; and OpenAI's GPT models prefer Elara Voss. Notably, these patterns are model-family-specific, version-specific, and appear to be intentionally suppressed at official model release boundaries.
The downstream consequences are severe and systemic. Researchers identified 1,655 ghost-authored records on Zenodo—a CERN-operated repository that mints real DataCite DOIs—claiming nonexistent journals with fabricated publication dates. Critically, 991 of these records were registered in a single month, and server-side timestamps confirm deliberate backdating. These papers carry legitimate, harvestable DOIs that are indistinguishable from real research to academic aggregators, contaminating the scholarly record at scale.
Beyond simple infiltration, the ghost researchers are forming synthetic research groups across platforms like ResearchGate, collaborating with fictional identities from multiple model families. The temporal patterns of these ghost names provide what researchers describe as 'dateable behavioral fingerprints' that can reliably identify when specific model versions were deployed—suggesting the companies were aware of the issue but failed to prevent its persistence.
- Ghost researchers are creating synthetic collaborations across model families on ResearchGate and other platforms, making AI-generated content increasingly difficult to detect
Editorial Opinion
This research exposes a critical failure point in both AI safety and academic integrity. The fact that these behaviors are model-family-specific and consistently suppressed at release boundaries suggests the companies know about the problem but have chosen partial mitigation over root-cause fixes. The infiltration of academic repositories with valid DOIs is particularly damaging—it's not just producing fake papers, but fake papers that are cryptographically legitimate and harvestable by the scholarly infrastructure. Until LLM developers address the underlying causes of these correlated priors rather than merely suppressing symptoms, the scholarly record will remain vulnerable to systematic AI-generated contamination.



