ZML Releases Universal Diagnostic Tool for GPUs, TPUs, and NPUs Across All Major Platforms
Key Takeaways
- ▸zml-smi provides unified monitoring across NVIDIA, AMD, Google TPU, and AWS Trainium devices with a single interface
- ▸The tool offers comprehensive metrics including GPU utilization, temperature, power draw, memory usage, and process-level resource consumption
- ▸zml-smi uses creative sandboxing techniques to support the latest AMD GPU models without requiring system-level installations or library patches
Summary
ZML has launched zml-smi, a universal diagnostic and monitoring tool designed to provide real-time performance insights across multiple AI hardware platforms including NVIDIA GPUs, AMD GPUs, Google TPUs, and AWS Trainium devices. The tool combines functionality similar to nvidia-smi and nvtop, offering comprehensive hardware monitoring capabilities without requiring additional software beyond device drivers and GLIBC.
zml-smi displays an extensive range of metrics including GPU utilization, temperature, power draw, memory usage, and process-level resource consumption. The tool uses platform-specific libraries—NVML for NVIDIA, AMD SMI for AMD, gRPC for Google TPU, and private APIs for AWS Trainium—to gather accurate performance data. A key innovation is its ability to recognize the latest AMD GPU models by dynamically merging GPU identification files from both Mesa and ROCm at build time, ensuring support for cutting-edge hardware like the Ryzen AI Max+ 395.
The tool is available for download as a self-contained binary that works across different hardware configurations. zml-smi also provides host-level metrics such as CPU model, memory usage, and process details with full cross-platform compatibility, making it a significant step toward unified hardware monitoring in the increasingly diverse AI accelerator landscape.
- Designed as a lightweight, self-contained binary that requires minimal dependencies beyond device drivers and GLIBC
Editorial Opinion
The release of zml-smi addresses a growing pain point in the AI hardware ecosystem: the fragmentation of monitoring tools across different accelerator vendors. As organizations increasingly adopt diverse hardware accelerators, having a unified diagnostic tool that works across NVIDIA, AMD, Google, and AWS platforms significantly improves operational efficiency. The technical implementation, particularly the clever sandboxing approach for AMD GPU support, demonstrates thoughtful engineering that balances compatibility with maintainability.



