zml-smi: Universal GPU, TPU, and NPU Monitoring Tool Now Available
Key Takeaways
- ▸zml-smi is a unified monitoring tool supporting NVIDIA, AMD, Google TPU, and AWS Trainium devices with plans for future expansion
- ▸The tool provides real-time performance metrics including GPU utilization, temperature, power draw, memory usage, and process-level insights across all platforms
- ▸zml-smi requires only device drivers and GLIBC, making it lightweight and easy to deploy without additional software dependencies
Summary
ZML has released zml-smi, a universal diagnostic and monitoring tool designed to provide real-time insights into the performance and health of GPUs, TPUs, and NPUs across multiple hardware platforms. The tool transparently supports NVIDIA, AMD, Google TPU, and AWS Trainium devices, with plans to expand support for additional platforms as ZML's hardware compatibility grows. zml-smi combines the functionality of nvidia-smi and nvtop into a single cross-platform utility that requires minimal dependencies—only device drivers and GLIBC—making it lightweight and easy to deploy.
The tool offers comprehensive monitoring capabilities including GPU utilization, temperature, power draw, memory usage, and detailed process-level metrics across all supported platforms. zml-smi displays host-level system information such as CPU model, memory usage, and load averages, while also providing device-specific metrics tailored to each hardware type. Notably, the tool implements innovative engineering solutions, such as intercepting file system calls for AMD GPU driver compatibility, to ensure seamless operation across diverse hardware ecosystems without requiring external installations or patches.
- The tool uses innovative sandboxing and API interception techniques to support the latest hardware models without requiring driver updates or system modifications
Editorial Opinion
zml-smi addresses a significant pain point in the AI infrastructure ecosystem by providing a unified monitoring solution across fragmented hardware platforms. As AI workloads increasingly leverage diverse accelerators beyond NVIDIA GPUs, having a single tool that works seamlessly across NVIDIA, AMD, Google, and AWS hardware is valuable for operations teams and researchers. The technical approach—particularly the creative sandboxing solution for AMD drivers—demonstrates thoughtful engineering that prioritizes ease of deployment and minimal system impact.



