BotBeat
...
← Back

> ▌

AI Industry (Unknown)AI Industry (Unknown)
INDUSTRY REPORTAI Industry (Unknown)2026-06-07

LLM Training Crawlers Overwhelm SourceHut, Disrupting Open-Source Infrastructure

Key Takeaways

  • ▸Sophisticated LLM crawlers bypassed SourceHut's defenses, forcing the platform to degrade service by disabling routes
  • ▸The attack reveals that standard bot mitigation strategies are insufficient against determined, well-resourced data collectors
  • ▸Legitimate users continue to experience disruptions despite deployed mitigations
Source:
Hacker Newshttps://status.sr.ht/issues/2026-06-06-llms-again/↗

Summary

SourceHut's git.sr.ht service has been severely disrupted by an aggressive, sophisticated botnet scraping the platform for LLM training data. The attackers successfully circumvented the platform's standard anti-bot defenses, forcing administrators to disable numerous web service routes to stabilize the system. While mitigations have reduced the impact, service disruptions continue for legitimate users.

The incident highlights a growing tension between AI companies' voracious appetite for training data and the stability of critical open-source infrastructure. The unnamed crawlers' ability to bypass standard defenses suggests either significant resources or highly coordinated, persistent collection efforts. The disruption raises urgent questions about data collection practices across the AI industry and whether the current approach is sustainable for public infrastructure.

  • The incident exemplifies a systemic issue: AI training data collection practices lack coordination and accountability mechanisms

Editorial Opinion

This isn't the first time AI companies' data collection has disrupted critical infrastructure, and it likely won't be the last. While training data is essential for LLM development, the industry's approach—aggressively scraping public platforms without coordination—is ethically and practically indefensible. Major AI labs should establish transparent data licensing practices, respect robots.txt, and engage with open-source communities rather than treat them as targets for extraction. The current Wild West approach to data collection threatens the collaborative infrastructure that benefits everyone, including AI companies themselves.

Large Language Models (LLMs)Ethics & BiasPrivacy & DataOpen Source

Comments

Suggested

Unknown AI ModelUnknown AI Model
INDUSTRY REPORT

AI-Generated Story Wins Commonwealth Short Story Prize, Sparking Authenticity Debate

2026-06-07
OpenAIOpenAI
INDUSTRY REPORT

Companies Are Using Reddit to Manipulate ChatGPT and Google AI Search

2026-06-07
AnthropicAnthropic
RESEARCH

Research Reveals AI Agents Cost 1000x More Than Expected—and Model Efficiency Varies Dramatically

2026-06-07
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us