Runway Cuts GPU Model Load Times 60x With Peer-to-Peer Weight Sharing

Key Takeaways

▸NCCLBack reduces GPU cold-start times 60x by broadcasting model weights between GPUs instead of downloading from storage
▸Peer-to-peer weight transfer leverages high-bandwidth GPU interconnects (InfiniBand 200-400 Gbps, NVLink 900 GB/s) versus cloud storage (2-10 Gbps)
▸Eliminates the 'thundering herd' problem where dozens of workers simultaneously download identical model weights, saturating storage bandwidth

Source:

Hacker Newshttps://runwayml.com/news/60x-faster-cold-starts-treating-peer-gpus-as-weight-servers↗

Summary

Runway has developed NCCLBack, a system that reduces GPU inference model cold-start times from minutes to seconds by leveraging peer-to-peer weight transfer instead of centralized storage downloads. In traditional deployments, every GPU worker independently downloads model weights from cloud storage—a process that takes minutes per worker and creates a bottleneck during fleet-wide rollouts. NCCLBack changes the architecture: one worker downloads weights normally, then broadcasts them directly to peer GPUs over high-bandwidth GPU interconnects like NVLink and InfiniBand, reducing cold-start time from minutes to seconds.

The technical insight is physics-based: GCS download bandwidth provides 2–10 Gbps per worker, while InfiniBand or RoCE links deliver 200–400 Gbps, and NVLink reaches 900 GB/s on H100 systems. By treating existing loaded GPUs as weight servers, Runway eliminates the "thundering herd" problem where dozens of workers simultaneously saturate storage bandwidth with redundant downloads. The system is built as a stack of layers handling discovery, coordination, transfer, and verification, centered on NCCL's broadcast primitive.

For Runway's operations, NCCLBack enables dozens of deployments per day with minimal latency impact, faster autoscaling, and quicker feedback loops for research and engineering teams. The system keeps GPUs performing inference rather than waiting through slow storage downloads, directly improving user-facing latency and cluster utilization. The peer-to-peer approach is inherently more scalable than distributed caching or shared storage solutions, as bandwidth grows with fleet size rather than requiring additional infrastructure.

Enables Runway to deploy dozens of times daily with negligible cold-start overhead, improving autoscaling speed and user-facing latency
Solves the operational complexity of distributed caching and shared storage by using existing idle GPU capacity in the cluster

Editorial Opinion

NCCLBack is an elegant exploitation of hardware topology that demonstrates how physical constraints drive infrastructure design. By recognizing that GPUs already have the bandwidth capacity to share weights and that peer discovery within a cluster is deterministic, Runway's team found a solution that's both simpler operationally and dramatically faster than alternatives like distributed caches or NFS. This work illustrates why the scaling problem for AI inference isn't just about raw compute—it's about moving data efficiently, and sometimes the best solution is already in your cluster.

Runway Cuts GPU Model Load Times 60x With Peer-to-Peer Weight Sharing

Key Takeaways

▸NCCLBack reduces GPU cold-start times 60x by broadcasting model weights between GPUs instead of downloading from storage
▸Peer-to-peer weight transfer leverages high-bandwidth GPU interconnects (InfiniBand 200-400 Gbps, NVLink 900 GB/s) versus cloud storage (2-10 Gbps)
▸Eliminates the 'thundering herd' problem where dozens of workers simultaneously download identical model weights, saturating storage bandwidth

Summary

Enables Runway to deploy dozens of times daily with negligible cold-start overhead, improving autoscaling speed and user-facing latency
Solves the operational complexity of distributed caching and shared storage by using existing idle GPU capacity in the cluster

Editorial Opinion

NCCLBack is an elegant exploitation of hardware topology that demonstrates how physical constraints drive infrastructure design. By recognizing that GPUs already have the bandwidth capacity to share weights and that peer discovery within a cluster is deterministic, Runway's team found a solution that's both simpler operationally and dramatically faster than alternatives like distributed caches or NFS. This work illustrates why the scaling problem for AI inference isn't just about raw compute—it's about moving data efficiently, and sometimes the best solution is already in your cluster.

Runway Cuts GPU Model Load Times 60x With Peer-to-Peer Weight Sharing

Key Takeaways

Summary

Editorial Opinion

More from Runway

Runway Demonstrates Real-Time Video AI Generation in Under 100 Milliseconds, Marking New Era of Instant Synthetic Content

Comments

Suggested

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

SpaceX Backs Anthropic with Massive Data Centre Deal Amidst Musk's OpenAI Legal Battle

Runway Cuts GPU Model Load Times 60x With Peer-to-Peer Weight Sharing

Key Takeaways

Summary

Editorial Opinion

More from Runway

Runway Demonstrates Real-Time Video AI Generation in Under 100 Milliseconds, Marking New Era of Instant Synthetic Content

Comments

Suggested

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

SpaceX Backs Anthropic with Massive Data Centre Deal Amidst Musk's OpenAI Legal Battle