BotBeat
...
← Back

> ▌

Alibaba (Cloud)Alibaba (Cloud)
RESEARCHAlibaba (Cloud)2026-03-27

Alibaba Achieves 1M Tokens/Second Throughput with Qwen 3.5 27B on vLLM

Key Takeaways

  • ▸Qwen 3.5 27B achieved 1 million tokens per second throughput on 96 B200 GPUs, setting a new benchmark for inference efficiency
  • ▸The integration of vLLM optimization with Qwen models and NVIDIA B200 hardware demonstrates effective system-level scaling
  • ▸This capability enables practical deployment of high-performance language models for latency-sensitive and throughput-demanding applications
Source:
Hacker Newshttps://medium.com/google-cloud/1-million-tokens-per-second-qwen-3-5-27b-on-gke-with-b200-gpus-161da5c1b592↗

Summary

Alibaba has demonstrated significant scaling achievements with its Qwen 3.5 27B language model, reaching a throughput of 1 million tokens per second when deployed across 96 NVIDIA B200 GPUs using the vLLM inference engine. This milestone represents a major advancement in LLM serving efficiency and scalability, showcasing the capability to handle massive inference workloads at production scale. The achievement highlights the effectiveness of combining Alibaba's Qwen model architecture with state-of-the-art inference optimization techniques and NVIDIA's latest GPU hardware. The 1M tokens/second throughput demonstrates significant progress in making large-scale language model inference practical and cost-effective for enterprise deployments.

  • The scaling achievement suggests viable paths for cost-effective production serving of large language models at enterprise scale

Editorial Opinion

This throughput achievement represents a crucial milestone in making LLM inference economically viable at scale. The ability to squeeze 1M tokens/second from a moderately-sized 27B model across 96 GPUs demonstrates the maturation of inference optimization techniques and hardware-software co-design. However, the real test lies in whether this performance translates to competitive per-token pricing and practical adoption in production environments.

Large Language Models (LLMs)Machine LearningMLOps & InfrastructureAI Hardware

More from Alibaba (Cloud)

Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

Training a 1.5B Parameter Model for OCaml Code Generation with GRPO and RLVR

2026-05-20
Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

Mechanistic Study Reveals How Qwen 3.5 Implements Political Censorship at the Circuit Level

2026-05-19
Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

Negation Neglect: Major Flaw Found in How LLMs Learn Negations

2026-05-15

Comments

Suggested

AnthropicAnthropic
PARTNERSHIP

Anthropic Expands Partnership with SpaceX, Scales GB200 Capacity in Colossus 2

2026-05-20
Research CommunityResearch Community
RESEARCH

New Methodology Proposed for Selecting Runtime Architecture Patterns in Production LLM Agents

2026-05-20
NVIDIANVIDIA
FUNDING & BUSINESS

NVIDIA Reports Record $81.6B Revenue in Q1 FY2027, Data Center Segment Surges 92% YoY

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us