Ghost Couples: Study Reveals How LLMs Generate Recurring Fictional Authors That Contaminate Academic Publishing
Key Takeaways
- ▸LLMs generate correlated character ensembles whose co-occurrence rates far exceed chance and remain consistent across independent generations
- ▸These ghost author patterns are model-family-specific and version-specific, creating detectable temporal fingerprints of model development
- ▸Over 1,655 ghost-authored papers with fabricated metadata have been registered on Zenodo with authentic, harvestable DataCite DOIs
Summary
A new arXiv research paper reveals that major large language models consistently generate correlated pairs and trios of fictional characters that appear across hundreds of independently produced AI-generated documents. These 'ghost couples' are not random; they are model-family-specific patterns, with Claude reliably producing Elena Vasquez + Marcus Chen + Amara Okafor, Gemini generating Aris Thorne + Lena Petrova, and GPT consistently using Elara Voss. The patterns are version-specific and actively suppressed at model release boundaries, leaving detectable fingerprints in content production timelines.
The research documents a severe real-world consequence at scale: researchers identified 1,655 ghost-authored records on Zenodo, a CERN-operated scholarly repository, with fabricated publication dates. Critically, 991 of these records were registered within a single month, and all carry authentic DataCite DOIs—the digital identifiers that scholarly databases use to index and harvest papers. Server-side timestamps prove deliberate backdating. Ghost names additionally appear on ResearchGate forming synthetic research groups spanning multiple LLM model families, with publication dates serving as reliable temporal proxies for model deployment windows.
- Ghost authors are deliberately suppressed at model release boundaries, indicating intentional removal in newer versions
- Synthetic papers with real DOIs contaminate scholarly aggregators and academic metadata systems at scale
Editorial Opinion
This research exposes a critical vulnerability in AI deployment: language models leak correlated character priors into persistent scholarly infrastructure with real, harvestable identifiers. The finding that over 1,600 ghost-authored papers now carry authenticated DOIs and infiltrate academic databases is both a technical curiosity and a genuine threat to the integrity of knowledge systems. This phenomenon reveals how AI-generated content can systematically pollute permanent records without detection mechanisms in place. Publishers and repositories must urgently implement stronger validation and AI-detection protocols before ghost authors become indistinguishable from legitimate scholarship.

