Developer Creates World's Smallest Llama2 Inference Engine in 1356 Bytes of x86 Assembly
Key Takeaways
- ▸Complete Llama2 inference engine implemented in just 1356 bytes of x86 assembly code
- ▸Boots directly from disk and generates text before OS loads, running stories260K model with 260K parameters
- ▸Uses aggressive int8 quantization, precomputed operation tables, and weight matrix fusion to minimize code size
Summary
A developer known as monax has created what may be the world's smallest Llama2 inference engine, fitting a complete language model inference system into just 1356 bytes of x86 real mode assembly. The implementation boots directly from disk and loads a quantized Llama2 model trained on children's stories, featuring 260K parameters across 5 layers and 8 attention heads with a 512-token vocabulary. It generates text before any operating system loads, demonstrating that full transformer inference is possible in minimal space.
The extreme optimization leverages several advanced techniques: int8 quantization with global absmax scaling, precomputed lookup tables for exponential and SiLU activation functions, and fused weight matrices that reduce three separate matrix multiplications to a single operation. The KV cache is quantized at runtime with per-token scaling, allowing the full 512-token context window to fit within the available memory constraints of the boot sector.
While intentionally optimized for minimal size at the expense of performance and precision, the project demonstrates the theoretical limits of transformer inference on constrained hardware. The creator invites assembly-level contributions from the community to further reduce the binary footprint and notes that scaling to larger models like Llama2-15M would require switching to protected or unreal mode to access additional memory.
- Maintains full transformer architecture with 512-token context window constrained to available boot sector memory
- Open-source project inviting community contributions to further optimize assembly-level code efficiency


