TL;DR
- Open-source visualisation and analytics platform from Grafana Labs, originally a fork of Kibana 3 in 2014. The de facto front end for Prometheus and the wider observability stack.
- Pluggable data-source model: queries are translated by per-source plugins, so a single dashboard can mix Prometheus, Loki, Tempo, Elasticsearch, PostgreSQL, BigQuery, CloudWatch, and dozens more.
- Core OSS is licensed AGPLv3 (re-licensed from Apache 2.0 in 2021); Grafana Labs sells a managed cloud and an enterprise edition on top.
- Standard companion to Prometheus on every production GPU cluster — DCGM, vLLM, KServe, and Kubernetes dashboards are all published by the upstream projects.
What Grafana Is#
Grafana is a web application that renders dashboards composed of panels. Each panel issues a query against a data source, transforms the result, and visualises it as a time series, table, gauge, stat, heatmap, or geo-map. Dashboards are stored as JSON, versioned in git, and provisioned automatically alongside the rest of the cluster's configuration.
It does not store data itself. Grafana is purely a query and visualisation layer — all metrics, logs, and traces continue to live in their respective backends. That separation is what makes Grafana portable across stacks and survivable across data-source migrations.
Data Sources#
Grafana supports tens of data sources through a plugin API. The ones that matter on AI infrastructure are:
| Source | Signal | Typical use |
|---|---|---|
| Prometheus | Metrics | GPU, CPU, network, inference throughput |
| Loki | Logs | kubelet, vLLM logs, training stdout |
| Tempo / Jaeger | Traces | Request flow through a serving stack |
| OpenTelemetry | All three | OTLP-native unified ingest |
| PostgreSQL / BigQuery | SQL | Billing, FinOps, business metrics |
| CloudWatch / Azure Monitor | Cloud metrics | Hybrid deployments |
Dashboards for AI Clusters#
Most teams do not build dashboards from scratch. The standard starting point is a small set of upstream dashboards plus a few cluster-specific overlays:
- NVIDIA DCGM Exporter Dashboard (grafana.com/dashboards/12239) — per-GPU utilisation, memory, power, NVLink.
- Kubernetes / Compute Resources / Cluster — node CPU, memory, pod state, courtesy of the kube-prometheus-stack.
- vLLM dashboard (shipped in the vLLM repo) — request rate, e2e latency, tokens per second, GPU cache utilisation.
- KServe Inference Service — model-server replicas, request volume, autoscaler decisions.
- InfiniBand or RoCEv2 fabric — link health, retransmits, congestion, courtesy of the Network Operator.
Alerting#
Grafana ships its own alerting subsystem (Grafana Alerting), which can either coexist with or replace Prometheus's Alertmanager. A unified alerting model lets the same UI write rules against any data source — including logs and traces — and route to the same destinations as Alertmanager. For pure-Prometheus shops, leaving alerts in Alertmanager and using Grafana only for visualisation is the simpler operational split.
Pick one alerting engine per environment. Splitting alerts between Prometheus rules and Grafana Alerting is the most common cause of duplicated pages and silenced-but-still-firing incidents.
Variables and Templating#
Dashboards parameterise queries with template variables — selectors at the top of the page that drive every panel below. The common pattern on multi-tenant or multi-cluster Grafana is a cascade of variables: cluster → namespace → node → GPU, each populated by a `label_values()` query against Prometheus.
# Variable definitions
$cluster = label_values(DCGM_FI_DEV_GPU_TEMP, cluster)
$node = label_values(DCGM_FI_DEV_GPU_TEMP{cluster="$cluster"}, node)
$gpu = label_values(DCGM_FI_DEV_GPU_TEMP{cluster="$cluster", node="$node"}, gpu)
# Panel query using all three
avg(DCGM_FI_PROF_SM_OCCUPANCY{cluster="$cluster", node="$node", gpu="$gpu"})Licensing#
Grafana OSS moved from Apache 2.0 to AGPLv3 in April 2021. For most users this changes nothing — internal use, even by a SaaS, is fine. The licence only attaches obligations when Grafana is modified and distributed (including offered as a network service). Grafana Cloud and Grafana Enterprise are commercial offerings with additional features (reporting, RBAC, enterprise data sources).
Where Grafana Fits#
Grafana is the operator's window into the cluster. In a complete AI observability stack it sits on top of Prometheus (metrics), Loki or OpenSearch (logs), Tempo or Jaeger (traces), and LLM-specific tools (Langfuse, Phoenix, Helicone) for product-level telemetry. Linked dashboards — click a high-latency panel, drill into the trace, jump to the logs — are the value Grafana adds beyond raw PromQL.
References
- Grafana Documentation · Grafana Labs
- Grafana on GitHub · GitHub
- NVIDIA DCGM Dashboard · Grafana Dashboards