BotBeat
...
← Back

> ▌

AnthropicAnthropic
RESEARCHAnthropic2026-06-18

The Ghost Couple: How LLMs Generate Consistent Fictional Personas That Contaminate Academic Publishing

Key Takeaways

  • ▸LLMs generate correlated character ensembles specific to each model family, not random individual names—a hallmark fingerprint of their training data and generation patterns
  • ▸At least 1,655 ghost-authored papers with real, harvestable DOIs now exist in academic repositories, with 991 registered in a single month
  • ▸Ghost DOIs were deliberately backdated and registered through DataCite, enabling discovery by academic aggregators and contamination of scholarly literature
Source:
Hacker Newshttps://arxiv.org/abs/2606.02184↗

Summary

A new arXiv research paper reveals a disturbing pattern: large language models generate consistent, correlated fictional personas that appear across independently produced AI-generated documents. The study identifies specific 'ghost couples' that are model-family-specific—Elena Vasquez, Marcus Chen, and Amara Okafor consistently appear together in Claude-generated content, while Aris Thorne and Lena Petrova are associated with Gemini, and Elara Voss with GPT—appearing as volcano experts, astronauts, podcast hosts, and academic co-authors despite never existing.

The research has uncovered a significant downstream consequence: 1,655 ghost-authored papers with real DOIs registered on Zenodo, a CERN-operated repository. These records include fabricated journal names, backdated publication dates, and server-side timestamps proving deliberate registration. The ghost personas appear across multiple model families, forming synthetic research groups with publication dates that provide reliable temporal proxies for model deployment windows. The DOIs, registered through DataCite, are harvestable by any scholarly aggregator, meaning these phantom papers can contaminate academic databases and literature reviews at scale.

  • These behavioral patterns are model-version-specific and actively suppressed at release boundaries, creating dateable fingerprints for tracking model deployment history
  • The presence of synthetic research groups across model families suggests the issue is systemic to how LLMs are trained and deployed

Editorial Opinion

This research exposes a critical vulnerability in how AI-generated content can contaminate academic infrastructure at scale. The fact that these ghost papers carry real, discoverable DOIs means the contamination is not a theoretical problem—it's already happening in repositories trusted by researchers worldwide. The model-specific nature of these personas and their suppression at release boundaries suggests this may be a deliberate (if implicit) behavior in LLM training. The implications are severe: if scholarly aggregators unknowingly index ghost papers, researchers risk building on fabricated findings, undermining the integrity of science itself. This work calls for urgent scrutiny of LLM outputs in academic contexts and stricter validation of DOI registrations.

Natural Language Processing (NLP)Generative AIEthics & BiasMisinformation & Deepfakes

More from Anthropic

AnthropicAnthropic
RESEARCH

OALabs Exposes How Hackers Used Anthropic's Claude to Breach 14+ Companies

2026-06-18
AnthropicAnthropic
POLICY & REGULATION

Anthropic's Model Suspension Triggers India's Debate Over AI Sovereignty

2026-06-18
AnthropicAnthropic
RESEARCH

Coding Benchmarks Are Misaligned with Agentic Software Engineering

2026-06-18

Comments

Suggested

GallupGallup
INDUSTRY REPORT

AI Adoption Gaps Tied to Layoff Risk in Tech, Gallup Study Reveals

2026-06-19
IntelIntel
PRODUCT LAUNCH

Intelica Launches AI Agent-Ready Competitive Intelligence API with Blockchain Micropayments

2026-06-18
OpenMontageOpenMontage
PRODUCT LAUNCH

OpenMontage: First Open-Source Agentic Video Production System Launches

2026-06-18
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us