zml-smi: Universal GPU, TPU, and NPU Monitoring Tool Now Available

Key Takeaways

▸zml-smi is a unified monitoring tool supporting NVIDIA, AMD, Google TPU, and AWS Trainium devices with plans for future expansion
▸The tool provides real-time performance metrics including GPU utilization, temperature, power draw, memory usage, and process-level insights across all platforms
▸zml-smi requires only device drivers and GLIBC, making it lightweight and easy to deploy without additional software dependencies

Source:

Hacker Newshttps://zml.ai/posts/zml-smi/↗

Summary

ZML has released zml-smi, a universal diagnostic and monitoring tool designed to provide real-time insights into the performance and health of GPUs, TPUs, and NPUs across multiple hardware platforms. The tool transparently supports NVIDIA, AMD, Google TPU, and AWS Trainium devices, with plans to expand support for additional platforms as ZML's hardware compatibility grows. zml-smi combines the functionality of nvidia-smi and nvtop into a single cross-platform utility that requires minimal dependencies—only device drivers and GLIBC—making it lightweight and easy to deploy.

The tool offers comprehensive monitoring capabilities including GPU utilization, temperature, power draw, memory usage, and detailed process-level metrics across all supported platforms. zml-smi displays host-level system information such as CPU model, memory usage, and load averages, while also providing device-specific metrics tailored to each hardware type. Notably, the tool implements innovative engineering solutions, such as intercepting file system calls for AMD GPU driver compatibility, to ensure seamless operation across diverse hardware ecosystems without requiring external installations or patches.

The tool uses innovative sandboxing and API interception techniques to support the latest hardware models without requiring driver updates or system modifications

Editorial Opinion

zml-smi addresses a significant pain point in the AI infrastructure ecosystem by providing a unified monitoring solution across fragmented hardware platforms. As AI workloads increasingly leverage diverse accelerators beyond NVIDIA GPUs, having a single tool that works seamlessly across NVIDIA, AMD, Google, and AWS hardware is valuable for operations teams and researchers. The technical approach—particularly the creative sandboxing solution for AMD drivers—demonstrates thoughtful engineering that prioritizes ease of deployment and minimal system impact.

zml-smi: Universal GPU, TPU, and NPU Monitoring Tool Now Available

Key Takeaways

▸zml-smi is a unified monitoring tool supporting NVIDIA, AMD, Google TPU, and AWS Trainium devices with plans for future expansion
▸The tool provides real-time performance metrics including GPU utilization, temperature, power draw, memory usage, and process-level insights across all platforms
▸zml-smi requires only device drivers and GLIBC, making it lightweight and easy to deploy without additional software dependencies

Summary

The tool uses innovative sandboxing and API interception techniques to support the latest hardware models without requiring driver updates or system modifications

Editorial Opinion

zml-smi addresses a significant pain point in the AI infrastructure ecosystem by providing a unified monitoring solution across fragmented hardware platforms. As AI workloads increasingly leverage diverse accelerators beyond NVIDIA GPUs, having a single tool that works seamlessly across NVIDIA, AMD, Google, and AWS hardware is valuable for operations teams and researchers. The technical approach—particularly the creative sandboxing solution for AMD drivers—demonstrates thoughtful engineering that prioritizes ease of deployment and minimal system impact.

zml-smi: Universal GPU, TPU, and NPU Monitoring Tool Now Available

Key Takeaways

Summary

Editorial Opinion

More from ZML

ZML Releases Universal Diagnostic Tool for GPUs, TPUs, and NPUs Across All Major Platforms

Comments

Suggested

China Bans Nvidia RTX 5090D V2 During CEO Huang's Visit, Escalating AI Hardware Trade War

Anthropic Claude Code Sandbox Bypass: Second Vulnerability Exposes Critical Data Exfiltration Risk

OpenAI Data Center Opposition Escalates: Michigan Township Treasurer Resigns After Threats

zml-smi: Universal GPU, TPU, and NPU Monitoring Tool Now Available

Key Takeaways

Summary

Editorial Opinion

More from ZML

ZML Releases Universal Diagnostic Tool for GPUs, TPUs, and NPUs Across All Major Platforms

Comments

Suggested

China Bans Nvidia RTX 5090D V2 During CEO Huang's Visit, Escalating AI Hardware Trade War

Anthropic Claude Code Sandbox Bypass: Second Vulnerability Exposes Critical Data Exfiltration Risk

OpenAI Data Center Opposition Escalates: Michigan Township Treasurer Resigns After Threats