BotBeat
...
← Back

> ▌

Alibaba (Cloud)Alibaba (Cloud)
RESEARCHAlibaba (Cloud)2026-03-27

Alibaba Achieves 1M Tokens/Second Throughput with Qwen 3.5 27B on vLLM

Key Takeaways

  • ▸Qwen 3.5 27B achieved 1 million tokens per second throughput on 96 B200 GPUs, setting a new benchmark for inference efficiency
  • ▸The integration of vLLM optimization with Qwen models and NVIDIA B200 hardware demonstrates effective system-level scaling
  • ▸This capability enables practical deployment of high-performance language models for latency-sensitive and throughput-demanding applications
Source:
Hacker Newshttps://medium.com/google-cloud/1-million-tokens-per-second-qwen-3-5-27b-on-gke-with-b200-gpus-161da5c1b592↗

Summary

Alibaba has demonstrated significant scaling achievements with its Qwen 3.5 27B language model, reaching a throughput of 1 million tokens per second when deployed across 96 NVIDIA B200 GPUs using the vLLM inference engine. This milestone represents a major advancement in LLM serving efficiency and scalability, showcasing the capability to handle massive inference workloads at production scale. The achievement highlights the effectiveness of combining Alibaba's Qwen model architecture with state-of-the-art inference optimization techniques and NVIDIA's latest GPU hardware. The 1M tokens/second throughput demonstrates significant progress in making large-scale language model inference practical and cost-effective for enterprise deployments.

  • The scaling achievement suggests viable paths for cost-effective production serving of large language models at enterprise scale

Editorial Opinion

This throughput achievement represents a crucial milestone in making LLM inference economically viable at scale. The ability to squeeze 1M tokens/second from a moderately-sized 27B model across 96 GPUs demonstrates the maturation of inference optimization techniques and hardware-software co-design. However, the real test lies in whether this performance translates to competitive per-token pricing and practical adoption in production environments.

Large Language Models (LLMs)Machine LearningMLOps & InfrastructureAI Hardware

More from Alibaba (Cloud)

Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

Single Transformer Layer Matches Full-Parameter RL Training Gains, Study Reveals

2026-07-02
Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

GLM 5.2 Outperforms MiniMax M3 on Code Generation Accuracy, But MiniMax Wins on Cost and Speed

2026-06-19
Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

Stanford Advances HIP Kernel Generation for AMD GPUs Using Multi-Agent Search and Reinforcement Learning

2026-06-19

Comments

Suggested

Alibaba GroupAlibaba Group
PRODUCT LAUNCH

Alibaba's Elements Claw AI Agent Discovers Four New Superconductors

2026-07-05
NVIDIANVIDIA
FUNDING & BUSINESS

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

2026-07-04
AppleApple
PRODUCT LAUNCH

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA

2026-07-04
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us