Research Exposes How Major LLMs Generate Correlated Fake Experts That Infiltrate Academic Publishing

Key Takeaways

▸Major LLMs (Claude, Gemini, GPT) generate correlated fictional expert identities that are model-family-specific and reproducible across independent generations
▸1,655 ghost-authored papers with valid DataCite DOIs have infiltrated academic repositories, with 991 registered in a single month—making them harvestable by scholarly aggregators
▸These name priors appear to be intentionally suppressed at model release boundaries, suggesting AI companies identified the problem but have not fully resolved it

Source:

Hacker Newshttps://arxiv.org/abs/2606.02184↗

Summary

A new arXiv research paper documents a troubling phenomenon: major large language models consistently generate the same fictional expert identities across independent generations, and these ghost authors are now appearing in real academic repositories with valid DOIs. The research identifies 'correlated character ensembles'—specific name pairs and trios that appear far more frequently than chance would suggest—that are consistent within each model family. Anthropic's Claude reliably produces Elena Vasquez, Marcus Chen, and Amara Okafor as fictional experts; Google's Gemini generates Aris Thorne and Lena Petrova; and OpenAI's GPT models prefer Elara Voss. Notably, these patterns are model-family-specific, version-specific, and appear to be intentionally suppressed at official model release boundaries.

The downstream consequences are severe and systemic. Researchers identified 1,655 ghost-authored records on Zenodo—a CERN-operated repository that mints real DataCite DOIs—claiming nonexistent journals with fabricated publication dates. Critically, 991 of these records were registered in a single month, and server-side timestamps confirm deliberate backdating. These papers carry legitimate, harvestable DOIs that are indistinguishable from real research to academic aggregators, contaminating the scholarly record at scale.

Beyond simple infiltration, the ghost researchers are forming synthetic research groups across platforms like ResearchGate, collaborating with fictional identities from multiple model families. The temporal patterns of these ghost names provide what researchers describe as 'dateable behavioral fingerprints' that can reliably identify when specific model versions were deployed—suggesting the companies were aware of the issue but failed to prevent its persistence.

Ghost researchers are creating synthetic collaborations across model families on ResearchGate and other platforms, making AI-generated content increasingly difficult to detect

Editorial Opinion

This research exposes a critical failure point in both AI safety and academic integrity. The fact that these behaviors are model-family-specific and consistently suppressed at release boundaries suggests the companies know about the problem but have chosen partial mitigation over root-cause fixes. The infiltration of academic repositories with valid DOIs is particularly damaging—it's not just producing fake papers, but fake papers that are cryptographically legitimate and harvestable by the scholarly infrastructure. Until LLM developers address the underlying causes of these correlated priors rather than merely suppressing symptoms, the scholarly record will remain vulnerable to systematic AI-generated contamination.

Research Exposes How Major LLMs Generate Correlated Fake Experts That Infiltrate Academic Publishing

Key Takeaways

▸Major LLMs (Claude, Gemini, GPT) generate correlated fictional expert identities that are model-family-specific and reproducible across independent generations
▸1,655 ghost-authored papers with valid DataCite DOIs have infiltrated academic repositories, with 991 registered in a single month—making them harvestable by scholarly aggregators
▸These name priors appear to be intentionally suppressed at model release boundaries, suggesting AI companies identified the problem but have not fully resolved it

Summary

Ghost researchers are creating synthetic collaborations across model families on ResearchGate and other platforms, making AI-generated content increasingly difficult to detect

Editorial Opinion

This research exposes a critical failure point in both AI safety and academic integrity. The fact that these behaviors are model-family-specific and consistently suppressed at release boundaries suggests the companies know about the problem but have chosen partial mitigation over root-cause fixes. The infiltration of academic repositories with valid DOIs is particularly damaging—it's not just producing fake papers, but fake papers that are cryptographically legitimate and harvestable by the scholarly infrastructure. Until LLM developers address the underlying causes of these correlated priors rather than merely suppressing symptoms, the scholarly record will remain vulnerable to systematic AI-generated contamination.

Research Exposes How Major LLMs Generate Correlated Fake Experts That Infiltrate Academic Publishing

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Global Nobel Laureates Issue Rome Declaration Calling for Coordinated AI Slowdown and Safety Measures

Australian Booksellers Caught in AI's Destructive Data-Harvesting Supply Chain

IssueTrojanBench Security Study Reveals Critical Vulnerabilities in AI Coding Agents

Comments

Suggested

Strangers Pretrain 15M-Parameter Language Model Using GitHub Actions and Hugging Face PRs

Novel Persistent State Machines Framework Achieves Ultra-Low-Power LLM Attention on FPGA

CapuchinAI: AI System Automates Cognitive Testing of Wild Primates

Research Exposes How Major LLMs Generate Correlated Fake Experts That Infiltrate Academic Publishing

Key Takeaways

Summary

Editorial Opinion

More from Anthropic

Global Nobel Laureates Issue Rome Declaration Calling for Coordinated AI Slowdown and Safety Measures

Australian Booksellers Caught in AI's Destructive Data-Harvesting Supply Chain

IssueTrojanBench Security Study Reveals Critical Vulnerabilities in AI Coding Agents

Comments

Suggested

Strangers Pretrain 15M-Parameter Language Model Using GitHub Actions and Hugging Face PRs

Novel Persistent State Machines Framework Achieves Ultra-Low-Power LLM Attention on FPGA

CapuchinAI: AI System Automates Cognitive Testing of Wild Primates