TL;DR
- Built into PyTorch as `torch.profiler` (released in 1.8.1, mid-2021), replacing the older `torch.autograd.profiler`. The standard way to capture CPU and CUDA operator timings inside a PyTorch programme.
- Captures per-operator CPU and GPU time, memory allocations, FLOPs, stack traces, and module hierarchy. Exports Chrome trace JSON, TensorBoard reports, and Holistic Trace Analysis (HTA) artefacts.
- Wraps Kineto (PyTorch's profiling library) on top of CUPTI, NVIDIA's profiling tools interface. On GPUs it can optionally drive Nsight Systems for system-wide capture alongside operator-level data.
- The right starting point for operator-level questions ('which layer is slow?', 'where is memory growing?'); use Nsight Systems for system-level questions and Nsight Compute for SASS-level questions.
What torch.profiler Does#
`torch.profiler.profile()` is a context manager that records every PyTorch operator and CUDA kernel executed inside it. For each operator it captures the call site, the CPU time and GPU time, the input shapes, the FLOPs (where computable), and any memory allocated. The output can be summarised as a table, exported to TensorBoard for interactive exploration, or written as a Chrome trace JSON viewable in any Chromium browser at `chrome://tracing`.
The library is built on Kineto, which sits on top of CUPTI on NVIDIA hardware. The same API works on CPU-only runs, on CUDA, and (with backends) on Intel XPU, AMD ROCm, and Apple MPS — but the depth of GPU information is highest on NVIDIA where CUPTI provides hardware counters.
Basic Usage#
import torch
from torch.profiler import profile, schedule, tensorboard_trace_handler, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=2, active=3, repeat=1),
on_trace_ready=tensorboard_trace_handler("./tb_logs"),
record_shapes=True,
profile_memory=True,
with_stack=True,
) as prof:
for step, batch in enumerate(loader):
train_step(batch)
prof.step() # tell the profiler each iteration boundary
if step >= 10:
break
# View in TensorBoard:
# tensorboard --logdir=./tb_logs
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))Schedule and Overhead#
Profiling is not free. CUPTI-backed CUDA profiling adds latency to every kernel launch; recording stack traces adds Python overhead; memory profiling adds bookkeeping. The `schedule(wait=N, warmup=N, active=N, repeat=N)` callable lets you skip the first iterations (loader warmup, autotune), warm up the profiler, capture a fixed window, and stop. The standard recipe captures 3-5 steady-state iterations rather than an entire training job.
For inference profiling, similar logic applies — profile a handful of representative requests, not production traffic.
Do not leave `profile()` enabled in production training loops. The CUPTI overhead is small per kernel but compounds across millions of operators per step.
What to Look For#
The default table view sorted by `cuda_time_total` answers 'which operators are eating the GPU?'. The top of the list is almost always GEMM (`aten::mm`, `aten::addmm`), attention (`aten::scaled_dot_product_attention`), and AllReduce (`nccl:all_reduce`) — healthy for an LLM workload. Anomalies that show up next include unexpected `aten::copy_` ranges (silent dtype conversions, non-contiguous tensors), `aten::sync` (manual `.item()` calls inside the loop), or `cudaMemcpyAsync` H2D dominating when you expected weights to be resident.
The TensorBoard plugin overlays a per-step timeline, GPU utilisation summary, memory profile, and module-hierarchy view that the raw table cannot show.
Memory Profiling#
Setting `profile_memory=True` records every CUDA allocation and free, attributable to the operator and stack frame that triggered it. The resulting memory timeline answers questions like 'when does the peak memory happen?' and 'which forward operator is responsible for the activation that the backward pass holds?'. This is the right tool for diagnosing CUDA OOMs that survive obvious fixes.
PyTorch 2.x also exposes a richer memory snapshot mechanism (`torch.cuda.memory._record_memory_history()`) that produces a visualisation of the entire allocator state over time — complementary to the profiler's per-operator view.
Exporting and Sharing#
Three export paths cover most needs:
- `prof.export_chrome_trace('trace.json')` — open in `chrome://tracing` or Perfetto. Shareable as a single JSON.
- `tensorboard_trace_handler('./logs')` — TensorBoard plugin with timeline + summary tabs.
- Holistic Trace Analysis (HTA, `pip install HolisticTraceAnalysis`) — Meta's open-source tool that ingests profiler traces and reports communication-vs-compute overlap, idle time, and frequent kernel patterns across many ranks.
Relationship to Nsight#
PyTorch Profiler and Nsight Systems are complementary. PyTorch Profiler emits NVTX ranges named after Python operators and modules, so a Nsight capture taken at the same time labels every kernel with the framework-level call site that triggered it. The standard performance workflow on a hard problem is: run torch.profiler to identify the slow operator class, then run Nsight Systems with NVTX to see why those operators are not overlapping or are starving the GPU, then run Nsight Compute on the specific kernel for low-level instruction-mix analysis.
References
- torch.profiler Documentation · PyTorch
- PyTorch Profiler Tutorial · PyTorch Tutorials
- Holistic Trace Analysis · GitHub (Meta)