nvidia-smi

TL;DR

Bundled with the NVIDIA driver. Single binary that talks to the NVIDIA Management Library (NVML) and prints utilisation, memory, power, processes, ECC counters, NVLink status, and MIG geometry.
Designed for interactive troubleshooting and scripting — not for time-series monitoring. For continuous metrics, use DCGM Exporter; for production telemetry, use Prometheus.
Supports machine-readable output via `--query-gpu=...` plus `--format=csv`, making it the standard primitive for shell scripts, CI smoke tests, and ad-hoc dashboards.
On systems with newer drivers, `nvidia-smi` is a thin wrapper that delegates the same NVML calls DCGM uses, so the two tools always agree on the underlying counters.

What nvidia-smi Does#

`nvidia-smi` (NVIDIA System Management Interface) is the CLI that ships with every NVIDIA data-centre driver. Running it with no arguments prints a one-page summary of every GPU on the host: name, driver and CUDA versions, persistence mode, current utilisation, memory used and total, power draw, temperature, fan speed, and the PID and memory footprint of every process currently holding a CUDA context.

Underneath it calls NVML, the same C library DCGM uses. Anything NVML exposes — clocks, ECC counters, PCIe link width, NVLink topology, MIG instance configuration, virtualisation mode, persistent error flags — is reachable through some `nvidia-smi` subcommand.

bash

# One-shot summary
nvidia-smi

# Continuous refresh, like top
nvidia-smi -l 1            # every 1 second
nvidia-smi dmon -s pucm    # power, utilisation, clocks, memory in CSV

# Machine-readable, perfect for scripts
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total,power.draw \
           --format=csv,noheader,nounits

Common Subcommands#

`nvidia-smi dmon` — device monitor, prints a rolling table of counters (power, utilisation, memory, clocks, temperature) at a configurable interval.
`nvidia-smi pmon` — process monitor, shows per-PID GPU usage including SM and memory engine activity.
`nvidia-smi topo -m` — print the matrix of NVLink, NVSwitch, and PCIe connections between every pair of GPUs.
`nvidia-smi mig -lgip` / `-cgi` / `-cci` — list, create, and configure MIG instances on Ampere, Hopper, and Blackwell.
`nvidia-smi nvlink --status` — per-link state and bandwidth counters.
`nvidia-smi --query-remapped-rows` — row-remapper status, the canary for impending DRAM failure on A100/H100/B200.

Reading the Default Output#

Three numbers from the default view drive most decisions:

Field	What it really means
GPU-Util	Percent of sample window where at least one kernel was active — not occupancy
Memory-Usage	Framebuffer reserved by CUDA contexts, including KV cache and weights
Pwr:Usage/Cap	Instantaneous board power vs configured TDP cap
Persistence-M	Whether the driver stays resident between CUDA contexts (set to On in production)
Compute M.	Default, Exclusive_Process, or Prohibited — locks down sharing
ECC errors	Single-bit (corrected) and double-bit (uncorrected) counters

`GPU-Util = 100 %` does not mean the GPU is being well used. A single tiny kernel hogging one SM reports 100 %. Always cross-reference with SM occupancy or Tensor Core activity from DCGM before declaring the GPU saturated.

Scripting Patterns#

Because `nvidia-smi --query-gpu` emits CSV with stable column names, it is the easiest way to embed GPU checks in CI, deployment scripts, or Slurm prologues.

bash

# Fail a CI job if any GPU has an uncorrectable ECC error
errors=$(nvidia-smi --query-gpu=ecc.errors.uncorrected.volatile.total \
                    --format=csv,noheader,nounits \
         | awk '{s+=$1} END {print s}')
if [[ "$errors" -gt 0 ]]; then
  echo "Uncorrectable ECC errors detected: $errors"
  exit 1
fi

# Quick health gate before launching a training job
nvidia-smi --query-gpu=temperature.gpu,power.draw,memory.free \
           --format=csv,noheader

Limits and Successors#

`nvidia-smi` is the right tool for human inspection and one-shot scripts. It is the wrong tool for time-series monitoring because each invocation forks a process, reinitialises NVML, and offers no aggregation. For continuous metrics, scrape DCGM Exporter into Prometheus. For per-kernel performance analysis, use Nsight Systems. For LLM-level telemetry (tokens, latency, cost), use Langfuse, Helicone, or Phoenix.

On modern drivers, NVIDIA ships `nvidia-smi` alongside `dcgmi`, the DCGM CLI, which exposes additional fields (profiling counters, policy management, fabric manager state) that `nvidia-smi` does not.

References

nvidia-smi Manual · NVIDIA Documentation
NVML API Reference · NVIDIA Documentation
DCGM CLI (dcgmi) · NVIDIA Documentation

What nvidia-smi Does#

bash

# One-shot summary
nvidia-smi

# Continuous refresh, like top
nvidia-smi -l 1            # every 1 second
nvidia-smi dmon -s pucm    # power, utilisation, clocks, memory in CSV

# Machine-readable, perfect for scripts
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total,power.draw \
           --format=csv,noheader,nounits

Common Subcommands#

`nvidia-smi dmon` — device monitor, prints a rolling table of counters (power, utilisation, memory, clocks, temperature) at a configurable interval.

`nvidia-smi pmon` — process monitor, shows per-PID GPU usage including SM and memory engine activity.

`nvidia-smi topo -m` — print the matrix of NVLink, NVSwitch, and PCIe connections between every pair of GPUs.

`nvidia-smi mig -lgip` / `-cgi` / `-cci` — list, create, and configure MIG instances on Ampere, Hopper, and Blackwell.

`nvidia-smi nvlink --status` — per-link state and bandwidth counters.

`nvidia-smi --query-remapped-rows` — row-remapper status, the canary for impending DRAM failure on A100/H100/B200.

Reading the Default Output#

Three numbers from the default view drive most decisions:

Field	What it really means
GPU-Util	Percent of sample window where at least one kernel was active — not occupancy
Memory-Usage	Framebuffer reserved by CUDA contexts, including KV cache and weights
Pwr:Usage/Cap	Instantaneous board power vs configured TDP cap
Persistence-M	Whether the driver stays resident between CUDA contexts (set to On in production)
Compute M.	Default, Exclusive_Process, or Prohibited — locks down sharing
ECC errors	Single-bit (corrected) and double-bit (uncorrected) counters

Scripting Patterns#

Because `nvidia-smi --query-gpu` emits CSV with stable column names, it is the easiest way to embed GPU checks in CI, deployment scripts, or Slurm prologues.

bash

# Fail a CI job if any GPU has an uncorrectable ECC error
errors=$(nvidia-smi --query-gpu=ecc.errors.uncorrected.volatile.total \
                    --format=csv,noheader,nounits \
         | awk '{s+=$1} END {print s}')
if [[ "$errors" -gt 0 ]]; then
  echo "Uncorrectable ECC errors detected: $errors"
  exit 1
fi

# Quick health gate before launching a training job
nvidia-smi --query-gpu=temperature.gpu,power.draw,memory.free \
           --format=csv,noheader

Limits and Successors#

On modern drivers, NVIDIA ships `nvidia-smi` alongside `dcgmi`, the DCGM CLI, which exposes additional fields (profiling counters, policy management, fabric manager state) that `nvidia-smi` does not.

nvidia-smi

What nvidia-smi Does#

Common Subcommands#

Reading the Default Output#

Scripting Patterns#

Limits and Successors#

References

Browse all entries

Deploy on Yobitel

nvidia-smi

What nvidia-smi Does#

Common Subcommands#

Reading the Default Output#

Scripting Patterns#

Limits and Successors#

References

Browse all entries

Deploy on Yobitel