Ray Serve LLM Achieves Major Performance Improvements with 4.4x-24.8x Throughput Gains

Key Takeaways

▸Ray Serve LLM now matches vllm-router performance with 4.4x improvement on prefill-heavy and 24.8x on decode-heavy workloads
▸Direct streaming decouples control plane from data plane, reducing routing overhead and improving response token latency
▸HAProxy integration and vLLM Ray Executor Backend V2 provide production-ready infrastructure for distributed LLM serving

Source:

Hacker Newshttps://www.anyscale.com/blog/high-performance-distributed-inference-ray-serve-llm-vllm-google-kubernetes-gke↗

Summary

Anyscale announced significant performance improvements to Ray Serve LLM in partnership with Google Cloud's Kubernetes Engine (GKE) team. The improvements deliver up to 4.4x higher request throughput on prefill-heavy workloads and up to 24.8x higher throughput on decode-heavy workloads, bringing Ray Serve LLM's performance on par with high-performance Rust-based routing frameworks like vllm-router.

Three major optimizations drive these gains: direct streaming mode (decoupling request routing from response streaming), a revamped vLLM Ray Executor Backend V2, and HAProxy integration for load balancing. Direct streaming eliminates bottlenecks by allowing HAProxy to establish direct HTTP connections with target replicas, reducing overhead in the routing layer and improving token-per-output latency.

These capabilities are now available in Ray 2.56, with HAProxy included in all rayproject/ray container images and installable via pip for custom deployments. The updates position Ray Serve LLM as a comprehensive solution for complex distributed LLM inference pipelines with heterogeneous hardware, offering fault tolerance, observability, and flexibility across Kubernetes and VM environments.

Ray 2.56 includes HAProxy with LLM-optimized container images, available immediately for deployment

Anyscale

UPDATE Anyscale2026-06-18

Ray Serve LLM Achieves Major Performance Improvements with 4.4x-24.8x Throughput Gains

Key Takeaways

▸Ray Serve LLM now matches vllm-router performance with 4.4x improvement on prefill-heavy and 24.8x on decode-heavy workloads
▸Direct streaming decouples control plane from data plane, reducing routing overhead and improving response token latency
▸HAProxy integration and vLLM Ray Executor Backend V2 provide production-ready infrastructure for distributed LLM serving

Source:

Hacker Newshttps://www.anyscale.com/blog/high-performance-distributed-inference-ray-serve-llm-vllm-google-kubernetes-gke↗

Summary

Ray 2.56 includes HAProxy with LLM-optimized container images, available immediately for deployment

Ray Serve LLM Achieves Major Performance Improvements with 4.4x-24.8x Throughput Gains

Key Takeaways

Summary

More from Anyscale

Ray 2.55 Brings Official Google Cloud TPU Support to Distributed Computing

vLLM Prefill Now Integrates with TileRT Decode for Latency-Optimized Serving

Data Processing Shifting to GPU Workloads as Enterprises Scale Multimodal AI

Comments

Suggested

OpenAI's Astra Solves 10 Major Math Problems, But Critics Warn Against Overgeneralization

Google's SynthID Watermark Proves Durable, But Questions Linger on Solving AI Disinformation

Netflix GenRec: LLM-Native Recommendation System Outperforms Production Ranker

Ray Serve LLM Achieves Major Performance Improvements with 4.4x-24.8x Throughput Gains

Key Takeaways

Summary

More from Anyscale

Ray 2.55 Brings Official Google Cloud TPU Support to Distributed Computing

vLLM Prefill Now Integrates with TileRT Decode for Latency-Optimized Serving

Data Processing Shifting to GPU Workloads as Enterprises Scale Multimodal AI

Comments

Suggested

OpenAI's Astra Solves 10 Major Math Problems, But Critics Warn Against Overgeneralization

Google's SynthID Watermark Proves Durable, But Questions Linger on Solving AI Disinformation

Netflix GenRec: LLM-Native Recommendation System Outperforms Production Ranker