Alibaba Achieves 1M Tokens/Second Throughput with Qwen 3.5 27B on vLLM

Key Takeaways

▸Qwen 3.5 27B achieved 1 million tokens per second throughput on 96 B200 GPUs, setting a new benchmark for inference efficiency
▸The integration of vLLM optimization with Qwen models and NVIDIA B200 hardware demonstrates effective system-level scaling
▸This capability enables practical deployment of high-performance language models for latency-sensitive and throughput-demanding applications

Source:

Hacker Newshttps://medium.com/google-cloud/1-million-tokens-per-second-qwen-3-5-27b-on-gke-with-b200-gpus-161da5c1b592↗

Summary

Alibaba has demonstrated significant scaling achievements with its Qwen 3.5 27B language model, reaching a throughput of 1 million tokens per second when deployed across 96 NVIDIA B200 GPUs using the vLLM inference engine. This milestone represents a major advancement in LLM serving efficiency and scalability, showcasing the capability to handle massive inference workloads at production scale. The achievement highlights the effectiveness of combining Alibaba's Qwen model architecture with state-of-the-art inference optimization techniques and NVIDIA's latest GPU hardware. The 1M tokens/second throughput demonstrates significant progress in making large-scale language model inference practical and cost-effective for enterprise deployments.

The scaling achievement suggests viable paths for cost-effective production serving of large language models at enterprise scale

Editorial Opinion

This throughput achievement represents a crucial milestone in making LLM inference economically viable at scale. The ability to squeeze 1M tokens/second from a moderately-sized 27B model across 96 GPUs demonstrates the maturation of inference optimization techniques and hardware-software co-design. However, the real test lies in whether this performance translates to competitive per-token pricing and practical adoption in production environments.

Alibaba (Cloud)

RESEARCH Alibaba (Cloud)2026-03-27

Alibaba Achieves 1M Tokens/Second Throughput with Qwen 3.5 27B on vLLM

Key Takeaways

▸Qwen 3.5 27B achieved 1 million tokens per second throughput on 96 B200 GPUs, setting a new benchmark for inference efficiency
▸The integration of vLLM optimization with Qwen models and NVIDIA B200 hardware demonstrates effective system-level scaling
▸This capability enables practical deployment of high-performance language models for latency-sensitive and throughput-demanding applications

Source:

Hacker Newshttps://medium.com/google-cloud/1-million-tokens-per-second-qwen-3-5-27b-on-gke-with-b200-gpus-161da5c1b592↗

Summary

The scaling achievement suggests viable paths for cost-effective production serving of large language models at enterprise scale

Editorial Opinion

This throughput achievement represents a crucial milestone in making LLM inference economically viable at scale. The ability to squeeze 1M tokens/second from a moderately-sized 27B model across 96 GPUs demonstrates the maturation of inference optimization techniques and hardware-software co-design. However, the real test lies in whether this performance translates to competitive per-token pricing and practical adoption in production environments.

Alibaba Achieves 1M Tokens/Second Throughput with Qwen 3.5 27B on vLLM

Key Takeaways

Summary

Editorial Opinion

More from Alibaba (Cloud)

Training a 1.5B Parameter Model for OCaml Code Generation with GRPO and RLVR

Mechanistic Study Reveals How Qwen 3.5 Implements Political Censorship at the Circuit Level

Negation Neglect: Major Flaw Found in How LLMs Learn Negations

Comments

Suggested

Anthropic Expands Partnership with SpaceX, Scales GB200 Capacity in Colossus 2

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

NVIDIA Reports Record $81.6B Revenue in Q1 FY2027, Data Center Segment Surges 92% YoY

Alibaba Achieves 1M Tokens/Second Throughput with Qwen 3.5 27B on vLLM

Key Takeaways

Summary

Editorial Opinion

More from Alibaba (Cloud)

Training a 1.5B Parameter Model for OCaml Code Generation with GRPO and RLVR

Mechanistic Study Reveals How Qwen 3.5 Implements Political Censorship at the Circuit Level

Negation Neglect: Major Flaw Found in How LLMs Learn Negations

Comments

Suggested

Anthropic Expands Partnership with SpaceX, Scales GB200 Capacity in Colossus 2

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

NVIDIA Reports Record $81.6B Revenue in Q1 FY2027, Data Center Segment Surges 92% YoY