Ray Serve LLM Achieves Major Performance Improvements with 4.4x-24.8x Throughput Gains
Key Takeaways
- ▸Ray Serve LLM now matches vllm-router performance with 4.4x improvement on prefill-heavy and 24.8x on decode-heavy workloads
- ▸Direct streaming decouples control plane from data plane, reducing routing overhead and improving response token latency
- ▸HAProxy integration and vLLM Ray Executor Backend V2 provide production-ready infrastructure for distributed LLM serving
Summary
Anyscale announced significant performance improvements to Ray Serve LLM in partnership with Google Cloud's Kubernetes Engine (GKE) team. The improvements deliver up to 4.4x higher request throughput on prefill-heavy workloads and up to 24.8x higher throughput on decode-heavy workloads, bringing Ray Serve LLM's performance on par with high-performance Rust-based routing frameworks like vllm-router.
Three major optimizations drive these gains: direct streaming mode (decoupling request routing from response streaming), a revamped vLLM Ray Executor Backend V2, and HAProxy integration for load balancing. Direct streaming eliminates bottlenecks by allowing HAProxy to establish direct HTTP connections with target replicas, reducing overhead in the routing layer and improving token-per-output latency.
These capabilities are now available in Ray 2.56, with HAProxy included in all rayproject/ray container images and installable via pip for custom deployments. The updates position Ray Serve LLM as a comprehensive solution for complex distributed LLM inference pipelines with heterogeneous hardware, offering fault tolerance, observability, and flexibility across Kubernetes and VM environments.
- Ray 2.56 includes HAProxy with LLM-optimized container images, available immediately for deployment


