OpenAI and Industry Partners Launch Multipath Reliable Connection Protocol for AI Infrastructure
Key Takeaways
- ▸MRC increases switch port density and network redundancy by repurposing aggregate bandwidth across more links rather than maximizing individual port speeds
- ▸The protocol enables dramatically lower-latency networks by reducing hop counts and switch counts, directly benefiting large-scale AI training performance
- ▸Built-in link failure recovery allows AI training to continue even when network connections fail, with repairs happening without stopping jobs
Summary
OpenAI, Microsoft, Broadcom, AMD, and Nvidia have unveiled Multipath Reliable Connection (MRC), a new networking protocol designed to improve scalability and efficiency of large-scale AI clusters. Rather than pursuing ever-higher bandwidth on individual network ports, MRC takes a different engineering approach: it increases the number of network links between devices while using the same aggregate bandwidth, creating flatter networks with fewer switches. This strategy significantly reduces latency (fewer hops between nodes), lowers infrastructure costs, and reduces power consumption—all critical factors for massive AI training operations.
MRC builds on existing RDMA over Converged Ethernet (RoCE) technology as a superset extension, making it a pragmatic middle ground between incremental improvements and the more radical Ultra Ethernet protocol. A key innovation is the protocol's link failure recovery mechanisms, which allow AI training jobs to continue operating even when individual network links fail, enabling repairs to happen without interrupting training runs—a critical capability for long-running distributed training jobs. The full specification, research paper, and Open Compute Project implementation have been publicly released, signaling the consortium's commitment to open industry standards.
- Multi-company collaboration (OpenAI, Microsoft, Broadcom, AMD, Nvidia) signals broad industry alignment on Ethernet-based solutions for AI infrastructure
- Open-source specifications democratize access to enterprise-grade networking for AI, with broader OCP adoption expected
Editorial Opinion
The MRC protocol represents a mature, pragmatic evolution in AI infrastructure design. Rather than chasing theoretical maximum bandwidth, the consortium has correctly identified that distributed AI training benefits far more from redundancy, graceful failure handling, and flat network topologies. This collaborative, open approach to networking standards could accelerate adoption of more resilient infrastructure across the industry—though real-world deployment complexity and integration with existing systems will ultimately determine whether MRC achieves the scale its designers intended.



