LLM Training Crawlers Overwhelm SourceHut, Disrupting Open-Source Infrastructure

Key Takeaways

▸Sophisticated LLM crawlers bypassed SourceHut's defenses, forcing the platform to degrade service by disabling routes
▸The attack reveals that standard bot mitigation strategies are insufficient against determined, well-resourced data collectors
▸Legitimate users continue to experience disruptions despite deployed mitigations

Source:

Hacker Newshttps://status.sr.ht/issues/2026-06-06-llms-again/↗

Summary

SourceHut's git.sr.ht service has been severely disrupted by an aggressive, sophisticated botnet scraping the platform for LLM training data. The attackers successfully circumvented the platform's standard anti-bot defenses, forcing administrators to disable numerous web service routes to stabilize the system. While mitigations have reduced the impact, service disruptions continue for legitimate users.

The incident highlights a growing tension between AI companies' voracious appetite for training data and the stability of critical open-source infrastructure. The unnamed crawlers' ability to bypass standard defenses suggests either significant resources or highly coordinated, persistent collection efforts. The disruption raises urgent questions about data collection practices across the AI industry and whether the current approach is sustainable for public infrastructure.

The incident exemplifies a systemic issue: AI training data collection practices lack coordination and accountability mechanisms

Editorial Opinion

This isn't the first time AI companies' data collection has disrupted critical infrastructure, and it likely won't be the last. While training data is essential for LLM development, the industry's approach—aggressively scraping public platforms without coordination—is ethically and practically indefensible. Major AI labs should establish transparent data licensing practices, respect robots.txt, and engage with open-source communities rather than treat them as targets for extraction. The current Wild West approach to data collection threatens the collaborative infrastructure that benefits everyone, including AI companies themselves.

LLM Training Crawlers Overwhelm SourceHut, Disrupting Open-Source Infrastructure

Key Takeaways

▸Sophisticated LLM crawlers bypassed SourceHut's defenses, forcing the platform to degrade service by disabling routes
▸The attack reveals that standard bot mitigation strategies are insufficient against determined, well-resourced data collectors
▸Legitimate users continue to experience disruptions despite deployed mitigations

Summary

The incident exemplifies a systemic issue: AI training data collection practices lack coordination and accountability mechanisms

Editorial Opinion

This isn't the first time AI companies' data collection has disrupted critical infrastructure, and it likely won't be the last. While training data is essential for LLM development, the industry's approach—aggressively scraping public platforms without coordination—is ethically and practically indefensible. Major AI labs should establish transparent data licensing practices, respect robots.txt, and engage with open-source communities rather than treat them as targets for extraction. The current Wild West approach to data collection threatens the collaborative infrastructure that benefits everyone, including AI companies themselves.

LLM Training Crawlers Overwhelm SourceHut, Disrupting Open-Source Infrastructure

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Hugging Face Launches Tau: An Open-Source Coding Agent Built as an Educational Framework

Tech Workers Unionize at Growing Rate as AI Deployment Sparks Job Security Fears

Moonshot AI's Free Kimi Model Ignites Divisions in Trump's AI Strategy

LLM Training Crawlers Overwhelm SourceHut, Disrupting Open-Source Infrastructure

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Hugging Face Launches Tau: An Open-Source Coding Agent Built as an Educational Framework

Tech Workers Unionize at Growing Rate as AI Deployment Sparks Job Security Fears

Moonshot AI's Free Kimi Model Ignites Divisions in Trump's AI Strategy