The Ghost Couple: How LLMs Generate Consistent Fictional Personas That Contaminate Academic Publishing
Key Takeaways
- ▸LLMs generate correlated character ensembles specific to each model family, not random individual names—a hallmark fingerprint of their training data and generation patterns
- ▸At least 1,655 ghost-authored papers with real, harvestable DOIs now exist in academic repositories, with 991 registered in a single month
- ▸Ghost DOIs were deliberately backdated and registered through DataCite, enabling discovery by academic aggregators and contamination of scholarly literature
Summary
A new arXiv research paper reveals a disturbing pattern: large language models generate consistent, correlated fictional personas that appear across independently produced AI-generated documents. The study identifies specific 'ghost couples' that are model-family-specific—Elena Vasquez, Marcus Chen, and Amara Okafor consistently appear together in Claude-generated content, while Aris Thorne and Lena Petrova are associated with Gemini, and Elara Voss with GPT—appearing as volcano experts, astronauts, podcast hosts, and academic co-authors despite never existing.
The research has uncovered a significant downstream consequence: 1,655 ghost-authored papers with real DOIs registered on Zenodo, a CERN-operated repository. These records include fabricated journal names, backdated publication dates, and server-side timestamps proving deliberate registration. The ghost personas appear across multiple model families, forming synthetic research groups with publication dates that provide reliable temporal proxies for model deployment windows. The DOIs, registered through DataCite, are harvestable by any scholarly aggregator, meaning these phantom papers can contaminate academic databases and literature reviews at scale.
- These behavioral patterns are model-version-specific and actively suppressed at release boundaries, creating dateable fingerprints for tracking model deployment history
- The presence of synthetic research groups across model families suggests the issue is systemic to how LLMs are trained and deployed
Editorial Opinion
This research exposes a critical vulnerability in how AI-generated content can contaminate academic infrastructure at scale. The fact that these ghost papers carry real, discoverable DOIs means the contamination is not a theoretical problem—it's already happening in repositories trusted by researchers worldwide. The model-specific nature of these personas and their suppression at release boundaries suggests this may be a deliberate (if implicit) behavior in LLM training. The implications are severe: if scholarly aggregators unknowingly index ghost papers, researchers risk building on fabricated findings, undermining the integrity of science itself. This work calls for urgent scrutiny of LLM outputs in academic contexts and stricter validation of DOI registrations.



