LLM Training Crawlers Overwhelm SourceHut, Disrupting Open-Source Infrastructure
Key Takeaways
- ▸Sophisticated LLM crawlers bypassed SourceHut's defenses, forcing the platform to degrade service by disabling routes
- ▸The attack reveals that standard bot mitigation strategies are insufficient against determined, well-resourced data collectors
- ▸Legitimate users continue to experience disruptions despite deployed mitigations
Summary
SourceHut's git.sr.ht service has been severely disrupted by an aggressive, sophisticated botnet scraping the platform for LLM training data. The attackers successfully circumvented the platform's standard anti-bot defenses, forcing administrators to disable numerous web service routes to stabilize the system. While mitigations have reduced the impact, service disruptions continue for legitimate users.
The incident highlights a growing tension between AI companies' voracious appetite for training data and the stability of critical open-source infrastructure. The unnamed crawlers' ability to bypass standard defenses suggests either significant resources or highly coordinated, persistent collection efforts. The disruption raises urgent questions about data collection practices across the AI industry and whether the current approach is sustainable for public infrastructure.
- The incident exemplifies a systemic issue: AI training data collection practices lack coordination and accountability mechanisms
Editorial Opinion
This isn't the first time AI companies' data collection has disrupted critical infrastructure, and it likely won't be the last. While training data is essential for LLM development, the industry's approach—aggressively scraping public platforms without coordination—is ethically and practically indefensible. Major AI labs should establish transparent data licensing practices, respect robots.txt, and engage with open-source communities rather than treat them as targets for extraction. The current Wild West approach to data collection threatens the collaborative infrastructure that benefits everyone, including AI companies themselves.



