FairyFuse Achieves 32.4x Speedup on Intel CPUs Using Multiplication-Free Ternary LLM Inference
Key Takeaways
- ▸Multiplication-free inference via ternary weights (-1, 0, +1) achieves 32.4 tokens/sec on Intel Xeon, 1.24x faster than Q4_K_M quantization
- ▸Fused AVX-512 kernels eliminate floating-point operations entirely, shifting bottleneck from memory to compute on bandwidth-limited CPUs
- ▸Near-lossless quality maintained: 5.52 vs. 5.47 WikiText-2 perplexity compared to FP16, with 66.0% downstream task accuracy
Summary
Researcher Paul Houle has published FairyFuse, an innovative inference system that enables large language models to run significantly faster on CPU-only platforms by eliminating floating-point multiplications entirely. The system leverages ternary weight quantization—representing weights as {-1, 0, +1}—and fuses computational operations into single AVX-512 loops using only masked additions and subtractions. This architectural approach addresses a critical bottleneck in CPU inference: memory bandwidth limitations.
On an Intel Xeon 8558P, FairyFuse achieves 32.4 tokens per second, outperforming the widely-used Q4_K_M quantization by 1.24x while maintaining near-lossless quality (WikiText-2 perplexity of 5.52 vs. 5.47 for FP16). Roofline analysis reveals that 16x weight compression successfully shifts memory-bound operations toward compute-bound regimes on bandwidth-limited CPUs—yielding a 29.6x kernel speedup—though the benefits are negligible on GPUs. The research builds on earlier work showing ternary LLMs can match FP16 quality, but FairyFuse is the first system to efficiently exploit this structure in production inference.
The work has immediate implications for CPU-based LLM deployment, particularly in cost-constrained, edge, and data-center environments where GPUs are unavailable or economically impractical.
- 29.6x kernel speedup demonstrates that aggressive quantization can dramatically improve CPU inference when properly optimized for hardware architecture
Editorial Opinion
FairyFuse represents a meaningful step toward making LLM inference practical on commodity CPUs. By rethinking the entire computation pipeline around ternary weights rather than bolting quantization onto existing systems, Houle demonstrates that CPU inference can achieve competitive performance without GPU acceleration. This work is particularly valuable for edge deployments, developing-world applications, and cost-sensitive workloads where GPU clusters are unfeasible. The research community would benefit from open-source implementation and broader benchmarking across additional model architectures and datasets.


