Frontier AI Companies Train Models on User Chats by Default, New Privacy Analysis Reveals
Key Takeaways
- ▸All six major U.S. frontier AI developers examined use user chat data for model training by default, with some retaining data indefinitely
- ▸Companies may collect and train on sensitive personal information including biometric and health data from user conversations
- ▸Four of six companies include children's chat data in model training, raising significant ethical and legal concerns
Summary
A comprehensive research paper published on arXiv examines the privacy policies of six major U.S. frontier AI developers, revealing widespread practices of using user chat data for model training without explicit consent. The study, conducted by researchers including Jennifer King, Kevin Klyman, and Emily Capstick, analyzes how companies like OpenAI, Anthropic, and others handle the hundreds of millions of conversations users have with AI chatbots.
The researchers found that all six examined companies use user chat data to train and improve their models by default, with some retaining this data indefinitely. The analysis reveals these companies may collect and train on personal information disclosed in conversations, including sensitive categories such as biometric and health data, as well as files uploaded by users. Particularly concerning, four of the six companies appear to include children's chat data in their training processes, alongside customer data from other products.
Applying a novel qualitative coding schema based primarily on the California Consumer Privacy Act (CCPA), the researchers identified significant gaps in transparency across the industry. Privacy policies often lack essential information about data practices, raising questions about user consent, data security risks from indefinite retention, and the ethical implications of training on children's data. The study concludes with recommendations for both policymakers and developers to address the mounting data privacy challenges posed by LLM-powered chatbots.
- Privacy policies across the industry lack transparency and essential information about data collection and use practices
- The research highlights urgent need for stronger regulations and greater accountability in AI chatbot data handling
Editorial Opinion
This research exposes a troubling disconnect between user expectations and industry practices in the AI chatbot space. While millions of people treat these conversations as ephemeral interactions, companies are building permanent data repositories that fuel competitive advantages—often without meaningful consent. The inclusion of children's data in training pipelines is particularly alarming and may violate existing child protection laws. As AI capabilities advance, the industry's 'collect everything by default' approach demands immediate regulatory intervention to protect user privacy rights.



