Nitsum: Adaptive Tensor Parallelism Optimizes Multi-Tier LLM Serving Under Fixed GPU Budgets
Key Takeaways
- ▸Nitsum reframes tensor parallelism as a runtime control knob rather than fixed deployment choice, enabling dynamic optimization for mixed SLO tiers
- ▸The system achieves up to 5.3x improvement in SLO-compliant goodput (requests meeting both TTFT and TPOT targets) on the same GPU budget
- ▸Tensor parallelism significantly impacts decode performance through L2 cache efficiency—higher TP reduces memory bandwidth bottlenecks at lower batch sizes, contrary to conventional wisdom about communication overhead
Summary
Nitsum is a novel LLM serving system that treats tensor parallelism (TP) as a dynamic runtime control surface rather than a fixed deployment parameter, enabling single clusters to efficiently serve heterogeneous workloads with different latency requirements. The system continuously reconfigures GPU allocation and TP levels to optimize for multiple service-level objectives (SLOs) simultaneously—Time To First Token (TTFT) for interactive requests and Time Per Output Token (TPOT) for batch workloads.
Production LLM deployments increasingly face a challenge: sharing infrastructure across vastly different workloads (interactive chat, AI agents, background jobs) with incompatible latency expectations and under fixed GPU budgets. Strict cluster separation wastes capacity; pooling requests together causes interference between SLO targets. Nitsum solves this by dynamically adjusting TP to prioritize the execution characteristics each workload tier needs.
The research reveals a counterintuitive insight: at lower batch sizes, higher TP can actually improve decode throughput and Time Per Output Token. This occurs because smaller weight matrix slices fit in GPU on-chip cache, reducing memory bandwidth bottlenecks despite the additional cross-GPU communication overhead. By making TP switching nearly free and continuously optimizing cluster configuration, Nitsum achieves up to 5.3x improvement in SLO-compliant goodput compared to state-of-the-art systems.
- Single LLM deployments can now economically serve both latency-critical and batch workloads simultaneously, eliminating the need for separate cluster tiers
Editorial Opinion
Nitsum tackles an increasingly critical problem in production LLM infrastructure: how to serve incompatible workload types efficiently under cost constraints. By treating tensor parallelism as a dynamic variable, the research opens new optimization possibilities for systems engineers managing heterogeneous deployments. The 5.3x improvement in goodput is substantial, and the insight about cache efficiency in decode operations suggests further gains are possible. If incorporated into production serving systems, this work could meaningfully reduce the infrastructure costs of LLM inference.



