Critical Vulkan GPU Bug Fixed in llama.cpp for 32-bit ARM Devices
Key Takeaways
- ▸A tensor stride calculation bug in llama.cpp was silently disabling Vulkan GPU acceleration on all 32-bit ARM devices
- ▸The overflow bug affected 32-bit systems but was masked on 64-bit architectures, explaining why it persisted undetected for years
- ▸The fix resolves a critical performance bottleneck for AI inference on mobile devices like smartwatches running 32-bit ARM architecture
Summary
A long-standing bug in llama.cpp has been identified and fixed that was silently disabling Vulkan GPU acceleration on all 32-bit ARM devices. The issue was discovered while running the framework on a Samsung Galaxy Watch 4 Classic and involved a missing block size division in tensor stride calculation within the create_tensor() function in llama-model-loader.cpp. The bug caused ggml_nbytes() overflow that exceeded max_buffer_size limits on 32-bit systems, leading to the Vulkan backend rejecting quantized MUL_MAT operations despite reporting successful layer offloading to GPU.
On 64-bit devices, the same bug was masked by the larger address space, allowing incorrect stride values to remain within GPU memory limits without triggering failures. This explains why the issue went undetected for years despite affecting a broad range of 32-bit ARM hardware. The fix has been made available on the axon-dev branch, addressing a critical performance limitation for mobile and embedded AI inference scenarios.
- The bug demonstrates how integer overflow issues can have asymmetric impact across different hardware architectures
Editorial Opinion
This bug fix highlights an important lesson in cross-platform GPU development: silent failures on edge architectures can persist for years when testing primarily focuses on mainstream 64-bit systems. The discovery that quantized operations were being rejected on 32-bit ARM while appearing to work correctly emphasizes the need for architecture-specific testing in AI inference frameworks targeting diverse hardware.



