Meta's In-Kernel Broadcast Optimization Cuts Recommendation Inference Latency by 2/3
Key Takeaways
- ▸IKBO eliminates redundant user-embedding broadcast by encoding it as a data layout concern within kernels rather than an explicit computational necessity
- ▸Achieves up to 2/3 reduction in compute-intensive net latency and 4× speedup for Linear Compression kernels through progressive co-design stages
- ▸Delivers 2.4-6.4× throughput improvement for Flash Attention kernels and shifts them from IO-bound to compute-bound operation
Summary
Meta has published research on In-Kernel Broadcast Optimization (IKBO), a kernel-model-system co-design approach that eliminates redundant user-embedding broadcast in recommendation model inference. Traditional recommendation systems explicitly replicate shared user embeddings for every candidate item, wasting memory bandwidth and compute resources that scale linearly with candidate count. IKBO solves this by fusing broadcast logic directly into user-candidate interaction kernels, so replicated tensors never materialize.
The optimization delivers substantial performance improvements across Meta's production systems. The IKBO Linear Compression kernel achieved a cumulative 4× speedup on H100 SXM5 hardware after four progressive co-design stages: matmul decomposition, memory alignment, broadcast fusion, and warp-specialized multi-stage fusion. For Flash Attention kernels, IKBO delivers 2.4-6.4× throughput improvement over non-optimized baselines while shifting operation from IO-bound to compute-bound, achieving 621 BF16 TFLOPs.
Deployed end-to-end across Meta's multi-stage recommendation funnel on both GPU and MTIA (Meta Training and Inference Accelerator) hardware, IKBO serves as the scalability backbone for Meta's request-centric framework powering the Meta Adaptive Ranking Model in production. The implementation has been open-sourced in the PyTorch FBGEMM repository.
- Deployed across Meta's full recommendation stack on GPU and MTIA accelerators, enabling LLM-scale recommendation models in production
Editorial Opinion
This work exemplifies the outsized returns of kernel-level optimization for inference workloads at scale. Rather than applying system-level workarounds, IKBO tackles redundancy at the computational primitive layer—a co-design philosophy that achieves 4-6.4× speedups compounding across multi-stage ranking pipelines. For any organization running recommendation systems at scale, IKBO demonstrates that careful attention to kernel design can unlock efficiency gains that higher-level optimizations cannot match.



