Massive AI Scraping Campaign Hits Record Scale: 1 in Every 2,000 Public IPs Involved
Key Takeaways
- ▸A single 24-hour scraping campaign involved approximately 1 in 2,000 public IPv4 addresses globally—2.04 million unique IPs generating over 5 million bot-classified requests
- ▸The attacks originated from major tech company networks, particularly Microsoft and Google / Alphabet, suggesting direct involvement from companies operating large-scale AI training operations
- ▸99.77% of traffic came from IPv4 addresses distributed strategically across 202 of 256 IPv4 /8 blocks, indicating deliberate load distribution to evade geographical or network-based blocking
Summary
A detailed analysis of one of the largest coordinated web scraping attacks on record reveals the staggering scale of data collection operations for AI training. On April 24th, 2026, infrastructure operator HotGarbage logged attacks from 2,040,670 unique IP addresses—representing approximately one in every 2,000 public IPv4 addresses globally—hitting their websites in a single 24-hour period. The sustained assault reached 4,000+ requests per minute, completely overwhelming a modest VPS with a single CPU core. A subsequent wave reached even higher volumes.
Analysis of the source IP addresses identified origins across diverse networks, with particularly prominent sources from Microsoft (AS8075) and Google / Alphabet (AS15169). A single Microsoft IP address (74.7.227.156) generated 150,483 requests alone, while multiple Google IPs showed coordinated attack patterns typical of systematic data collection operations. Across the global IPv4 address space, the attack involved 202 of 256 /8 blocks and generated over 5 million requests classified as bot traffic, demonstrating a deliberate strategy to distribute load across numerous IP addresses and evade traditional IP-based blocking defenses.
This incident provides rare visibility into the infrastructure costs and operational coordination required to scrape web content at the enormous scale demanded by modern AI training pipelines. It raises fundamental questions about the sustainability of defending against such operations, the adequacy of legal protections for content creators, and whether current regulatory frameworks are equipped to address AI-driven data collection at this magnitude.
- Standard VPS infrastructure (1 CPU core, 200 Mbps bandwidth) proved powerless against the assault, suggesting that defending against AI-driven scraping at this scale is nearly impossible for individual content creators and small-to-medium websites
Editorial Opinion
This incident starkly illustrates the enormous and highly coordinated scale of data collection operations that now underpin modern AI systems. When over 1% of all public IPv4 addresses participate in scraping activity in a single day, it becomes clear that web content harvesting for AI training has reached industrial proportions that dwarf traditional crawling and indexing. The documented involvement of major tech companies raises uncomfortable questions about how aggressive data acquisition practices have become, even as these same companies publicly commit to responsible AI development. Without significant regulatory intervention or industry-wide ethical commitments, the arms race between AI data demands and infrastructure defense will only intensify.


