New Research Exposes Privacy Gaps in Major AI Companies' Use of User Chat Data for Model Training
Key Takeaways
- ▸All six major U.S. AI developers studied use user chat data for model training by default, with some retaining data indefinitely
- ▸Companies may collect and train on sensitive personal information including biometric and health data, as well as uploaded files
- ▸Four of six companies appear to include children's chat data in training datasets, raising significant ethical and legal concerns
Summary
A comprehensive research paper published on arXiv has analyzed the privacy policies of six leading U.S. AI companies, revealing significant concerns about how user chat data is collected and used for training large language models. The study, authored by researchers including Jennifer King, Kevin Klyman, and colleagues, found that all six frontier AI developers appear to use user chat data for model training by default, with some retaining this data indefinitely. The research employed a novel qualitative coding schema based on the California Consumer Privacy Act to systematically compare data practices across companies.
The findings raise particular alarm about the collection of sensitive personal information disclosed in chats, including biometric and health data, as well as files uploaded by users. Four of the six companies examined appear to include children's chat data in their training datasets, alongside customer data from other products. The researchers note that privacy policies often lack essential details about these practices, creating a significant transparency gap between what users understand about their data and how it's actually being used.
The paper addresses critical implications including the lack of meaningful user consent for chat data usage in model training, data security risks from indefinite retention periods, and ethical concerns around training on children's data. The authors conclude with recommendations for both policymakers and developers to address these privacy challenges. This research comes at a crucial time as hundreds of millions of people worldwide now regularly interact with LLM-powered chatbots, often sharing personal and sensitive information without full awareness of how it may be repurposed.
- Privacy policies consistently lack essential transparency about data collection and usage practices
- The research provides specific recommendations for policymakers and developers to address LLM privacy challenges
Editorial Opinion
This research delivers a sobering reality check for the AI industry's approach to user privacy. While companies race to improve their models with ever-larger datasets, the default practice of training on user conversations—including sensitive personal information and children's data—without explicit, informed consent represents a fundamental misalignment between business incentives and user expectations. The lack of transparency documented in this study suggests that current self-regulation is insufficient, and may strengthen arguments for comprehensive AI-specific privacy legislation that goes beyond existing frameworks like CCPA.



