GitHub Reverses Course, Will Train AI Models on User Data by Default Starting April 24
Key Takeaways
- ▸GitHub will begin training AI models on user data (code snippets, inputs, outputs, and context) starting April 24, 2024, by default for free and paid Copilot tiers
- ▸The policy uses an opt-out model, allowing users to disable data training in privacy settings, though this differs from stricter opt-in requirements in Europe
- ▸Private repositories are no longer fully private when users have Copilot enabled, as code snippets can be collected for AI training purposes
Summary
Microsoft's GitHub announced it will begin using customer interaction data to train its AI models starting April 24, 2024, marking a significant policy reversal. The change applies to Copilot Free, Pro, and Pro+ users, while Copilot Business and Enterprise customers remain exempt. The data collection includes code snippets, inputs, outputs, file names, comments, and user interactions with Copilot features from both public and private repositories.
Users can opt out by visiting their privacy settings, following an opt-out model rather than the opt-in requirements typical in Europe. GitHub's Chief Product Officer Mario Rodriguez argued the data collection will improve model accuracy and code suggestions. However, the policy shift has generated significant community backlash, with users expressing skepticism about the rebranding of "private" repositories and concerns about consent, despite the fact that GitHub Copilot's underlying Codex model was already trained on publicly available GitHub code.
- The move has generated substantial community pushback, with GitHub users expressing concerns about consent and data privacy despite GitHub's claim that similar practices are industry standard
Editorial Opinion
GitHub's reversal on AI training data represents a troubling normalization of data extraction in the AI industry. While the company frames this as an opt-out choice, the fundamental issue remains: developers who depend on Copilot must actively resist to protect their code, rather than being asked for explicit consent. The fact that GitHub's own Copilot was built on previously scraped GitHub code illustrates how the AI industry has systematically extracted value from developers without meaningful consent—a precedent that makes this new policy feel inevitable rather than justified.


