BotBeat
...
← Back

> ▌

PhotoroomPhotoroom
RESEARCHPhotoroom2026-04-08

Photoroom Builds Custom Load Balancer to Optimize GPU Inference Efficiency

Key Takeaways

  • ▸Standard load balancing algorithms fail in low-throughput, high-latency scenarios with many backend nodes because each proxy only sees a fraction of total traffic
  • ▸Photoroom built a custom Redis-based load balancer providing global visibility of in-flight requests across GPU pods, eliminating queueing delays while improving utilization
  • ▸The architecture leverages Envoy's External Processing filter to implement distributed least-request load balancing with shared state, demonstrating the need for custom infrastructure at scale
Source:
Hacker Newshttps://www.photoroom.com/inside-photoroom/optimizing-our-inference-backend-with-custom-load-balancing↗

Summary

Photoroom, an AI image processing platform, developed a custom load balancing system to optimize its GPU inference backend after deploying a slower but higher-quality AI model. The company's standard load balancing algorithms (Round Robin, Least Request, Power of Two Choices) proved insufficient because each proxy node only had visibility into requests it directly sent, creating a "local view problem" where GPUs appeared idle despite being overloaded. When inference time increased from 300ms to ~1 second per request, latency spikes reached 7 seconds at p90 and 20+ seconds at p99, even though GPUs had spare capacity.

To solve this, Photoroom implemented a Redis-backed load balancing system where all Envoy proxy nodes access a shared, global view of in-flight requests across the entire GPU pod cluster. The solution uses Envoy's External Processing (ext_proc) filter to intercept requests at the header phase, query Redis for the least-loaded pod, and decrement counters when responses complete. This distributed least-request approach eliminates the information asymmetry that plagued earlier algorithms, allowing each routing decision to be based on complete cluster state rather than partial local observations.

Editorial Opinion

Photoroom's experience highlights a critical gap in off-the-shelf load balancing solutions for modern AI inference workloads. While traditional algorithms work well for high-volume, low-latency scenarios, AI services operating at scale with expensive GPU resources need visibility into global state to make optimal routing decisions. The Redis-based solution is pragmatic but also suggests that the infrastructure layer for AI serving is still maturing—future platforms may need to bake these patterns directly into their routing layers rather than requiring companies to build custom solutions.

MLOps & InfrastructureAI Hardware

Comments

Suggested

Industry-WideIndustry-Wide
INDUSTRY REPORT

Scaling AI is Now Constrained by Energy, Cooling, and Physics

2026-04-10
AstropadAstropad
PRODUCT LAUNCH

Astropad Launches Workbench: Remote Desktop Tool Built for AI Workflows and Apple Devices

2026-04-10
AnthropicAnthropic
INDUSTRY REPORT

Anthropic Explores In-House AI Chip Development to Reduce Dependency on Nvidia

2026-04-10
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us