TL;DR
- Bundled with the NVIDIA driver. Single binary that talks to the NVIDIA Management Library (NVML) and prints utilisation, memory, power, processes, ECC counters, NVLink status, and MIG geometry.
- Designed for interactive troubleshooting and scripting — not for time-series monitoring. For continuous metrics, use DCGM Exporter; for production telemetry, use Prometheus.
- Supports machine-readable output via `--query-gpu=...` plus `--format=csv`, making it the standard primitive for shell scripts, CI smoke tests, and ad-hoc dashboards.
- On systems with newer drivers, `nvidia-smi` is a thin wrapper that delegates the same NVML calls DCGM uses, so the two tools always agree on the underlying counters.
What nvidia-smi Does#
`nvidia-smi` (NVIDIA System Management Interface) is the CLI that ships with every NVIDIA data-centre driver. Running it with no arguments prints a one-page summary of every GPU on the host: name, driver and CUDA versions, persistence mode, current utilisation, memory used and total, power draw, temperature, fan speed, and the PID and memory footprint of every process currently holding a CUDA context.
Underneath it calls NVML, the same C library DCGM uses. Anything NVML exposes — clocks, ECC counters, PCIe link width, NVLink topology, MIG instance configuration, virtualisation mode, persistent error flags — is reachable through some `nvidia-smi` subcommand.
# One-shot summary
nvidia-smi
# Continuous refresh, like top
nvidia-smi -l 1 # every 1 second
nvidia-smi dmon -s pucm # power, utilisation, clocks, memory in CSV
# Machine-readable, perfect for scripts
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total,power.draw \
--format=csv,noheader,nounitsCommon Subcommands#
- `nvidia-smi dmon` — device monitor, prints a rolling table of counters (power, utilisation, memory, clocks, temperature) at a configurable interval.
- `nvidia-smi pmon` — process monitor, shows per-PID GPU usage including SM and memory engine activity.
- `nvidia-smi topo -m` — print the matrix of NVLink, NVSwitch, and PCIe connections between every pair of GPUs.
- `nvidia-smi mig -lgip` / `-cgi` / `-cci` — list, create, and configure MIG instances on Ampere, Hopper, and Blackwell.
- `nvidia-smi nvlink --status` — per-link state and bandwidth counters.
- `nvidia-smi --query-remapped-rows` — row-remapper status, the canary for impending DRAM failure on A100/H100/B200.
Reading the Default Output#
Three numbers from the default view drive most decisions:
| Field | What it really means |
|---|---|
| GPU-Util | Percent of sample window where at least one kernel was active — not occupancy |
| Memory-Usage | Framebuffer reserved by CUDA contexts, including KV cache and weights |
| Pwr:Usage/Cap | Instantaneous board power vs configured TDP cap |
| Persistence-M | Whether the driver stays resident between CUDA contexts (set to On in production) |
| Compute M. | Default, Exclusive_Process, or Prohibited — locks down sharing |
| ECC errors | Single-bit (corrected) and double-bit (uncorrected) counters |
`GPU-Util = 100 %` does not mean the GPU is being well used. A single tiny kernel hogging one SM reports 100 %. Always cross-reference with SM occupancy or Tensor Core activity from DCGM before declaring the GPU saturated.
Scripting Patterns#
Because `nvidia-smi --query-gpu` emits CSV with stable column names, it is the easiest way to embed GPU checks in CI, deployment scripts, or Slurm prologues.
# Fail a CI job if any GPU has an uncorrectable ECC error
errors=$(nvidia-smi --query-gpu=ecc.errors.uncorrected.volatile.total \
--format=csv,noheader,nounits \
| awk '{s+=$1} END {print s}')
if [[ "$errors" -gt 0 ]]; then
echo "Uncorrectable ECC errors detected: $errors"
exit 1
fi
# Quick health gate before launching a training job
nvidia-smi --query-gpu=temperature.gpu,power.draw,memory.free \
--format=csv,noheaderLimits and Successors#
`nvidia-smi` is the right tool for human inspection and one-shot scripts. It is the wrong tool for time-series monitoring because each invocation forks a process, reinitialises NVML, and offers no aggregation. For continuous metrics, scrape DCGM Exporter into Prometheus. For per-kernel performance analysis, use Nsight Systems. For LLM-level telemetry (tokens, latency, cost), use Langfuse, Helicone, or Phoenix.
On modern drivers, NVIDIA ships `nvidia-smi` alongside `dcgmi`, the DCGM CLI, which exposes additional fields (profiling counters, policy management, fabric manager state) that `nvidia-smi` does not.
References
- nvidia-smi Manual · NVIDIA Documentation
- NVML API Reference · NVIDIA Documentation
- DCGM CLI (dcgmi) · NVIDIA Documentation