Grafana

TL;DR

Open-source visualisation and analytics platform from Grafana Labs, originally a fork of Kibana 3 in 2014. The de facto front end for Prometheus and the wider observability stack.
Pluggable data-source model: queries are translated by per-source plugins, so a single dashboard can mix Prometheus, Loki, Tempo, Elasticsearch, PostgreSQL, BigQuery, CloudWatch, and dozens more.
Core OSS is licensed AGPLv3 (re-licensed from Apache 2.0 in 2021); Grafana Labs sells a managed cloud and an enterprise edition on top.
Standard companion to Prometheus on every production GPU cluster — DCGM, vLLM, KServe, and Kubernetes dashboards are all published by the upstream projects.

What Grafana Is#

Grafana is a web application that renders dashboards composed of panels. Each panel issues a query against a data source, transforms the result, and visualises it as a time series, table, gauge, stat, heatmap, or geo-map. Dashboards are stored as JSON, versioned in git, and provisioned automatically alongside the rest of the cluster's configuration.

It does not store data itself. Grafana is purely a query and visualisation layer — all metrics, logs, and traces continue to live in their respective backends. That separation is what makes Grafana portable across stacks and survivable across data-source migrations.

Data Sources#

Grafana supports tens of data sources through a plugin API. The ones that matter on AI infrastructure are:

Source	Signal	Typical use
Prometheus	Metrics	GPU, CPU, network, inference throughput
Loki	Logs	kubelet, vLLM logs, training stdout
Tempo / Jaeger	Traces	Request flow through a serving stack
OpenTelemetry	All three	OTLP-native unified ingest
PostgreSQL / BigQuery	SQL	Billing, FinOps, business metrics
CloudWatch / Azure Monitor	Cloud metrics	Hybrid deployments

Dashboards for AI Clusters#

Most teams do not build dashboards from scratch. The standard starting point is a small set of upstream dashboards plus a few cluster-specific overlays:

NVIDIA DCGM Exporter Dashboard (grafana.com/dashboards/12239) — per-GPU utilisation, memory, power, NVLink.
Kubernetes / Compute Resources / Cluster — node CPU, memory, pod state, courtesy of the kube-prometheus-stack.
vLLM dashboard (shipped in the vLLM repo) — request rate, e2e latency, tokens per second, GPU cache utilisation.
KServe Inference Service — model-server replicas, request volume, autoscaler decisions.
InfiniBand or RoCEv2 fabric — link health, retransmits, congestion, courtesy of the Network Operator.

Alerting#

Grafana ships its own alerting subsystem (Grafana Alerting), which can either coexist with or replace Prometheus's Alertmanager. A unified alerting model lets the same UI write rules against any data source — including logs and traces — and route to the same destinations as Alertmanager. For pure-Prometheus shops, leaving alerts in Alertmanager and using Grafana only for visualisation is the simpler operational split.

Pick one alerting engine per environment. Splitting alerts between Prometheus rules and Grafana Alerting is the most common cause of duplicated pages and silenced-but-still-firing incidents.

Variables and Templating#

Dashboards parameterise queries with template variables — selectors at the top of the page that drive every panel below. The common pattern on multi-tenant or multi-cluster Grafana is a cascade of variables: cluster → namespace → node → GPU, each populated by a `label_values()` query against Prometheus.

promql

# Variable definitions
$cluster   = label_values(DCGM_FI_DEV_GPU_TEMP, cluster)
$node      = label_values(DCGM_FI_DEV_GPU_TEMP{cluster="$cluster"}, node)
$gpu       = label_values(DCGM_FI_DEV_GPU_TEMP{cluster="$cluster", node="$node"}, gpu)

# Panel query using all three
avg(DCGM_FI_PROF_SM_OCCUPANCY{cluster="$cluster", node="$node", gpu="$gpu"})

Licensing#

Grafana OSS moved from Apache 2.0 to AGPLv3 in April 2021. For most users this changes nothing — internal use, even by a SaaS, is fine. The licence only attaches obligations when Grafana is modified and distributed (including offered as a network service). Grafana Cloud and Grafana Enterprise are commercial offerings with additional features (reporting, RBAC, enterprise data sources).

Where Grafana Fits#

Grafana is the operator's window into the cluster. In a complete AI observability stack it sits on top of Prometheus (metrics), Loki or OpenSearch (logs), Tempo or Jaeger (traces), and LLM-specific tools (Langfuse, Phoenix, Helicone) for product-level telemetry. Linked dashboards — click a high-latency panel, drill into the trace, jump to the logs — are the value Grafana adds beyond raw PromQL.

References

Grafana Documentation · Grafana Labs
Grafana on GitHub · GitHub
NVIDIA DCGM Dashboard · Grafana Dashboards

TL;DR

Open-source visualisation and analytics platform from Grafana Labs, originally a fork of Kibana 3 in 2014. The de facto front end for Prometheus and the wider observability stack.
Pluggable data-source model: queries are translated by per-source plugins, so a single dashboard can mix Prometheus, Loki, Tempo, Elasticsearch, PostgreSQL, BigQuery, CloudWatch, and dozens more.
Core OSS is licensed AGPLv3 (re-licensed from Apache 2.0 in 2021); Grafana Labs sells a managed cloud and an enterprise edition on top.
Standard companion to Prometheus on every production GPU cluster — DCGM, vLLM, KServe, and Kubernetes dashboards are all published by the upstream projects.

What Grafana Is#

Data Sources#

Grafana supports tens of data sources through a plugin API. The ones that matter on AI infrastructure are:

Source	Signal	Typical use
Prometheus	Metrics	GPU, CPU, network, inference throughput
Loki	Logs	kubelet, vLLM logs, training stdout
Tempo / Jaeger	Traces	Request flow through a serving stack
OpenTelemetry	All three	OTLP-native unified ingest
PostgreSQL / BigQuery	SQL	Billing, FinOps, business metrics
CloudWatch / Azure Monitor	Cloud metrics	Hybrid deployments

Dashboards for AI Clusters#

Most teams do not build dashboards from scratch. The standard starting point is a small set of upstream dashboards plus a few cluster-specific overlays:

NVIDIA DCGM Exporter Dashboard (grafana.com/dashboards/12239) — per-GPU utilisation, memory, power, NVLink.
Kubernetes / Compute Resources / Cluster — node CPU, memory, pod state, courtesy of the kube-prometheus-stack.
vLLM dashboard (shipped in the vLLM repo) — request rate, e2e latency, tokens per second, GPU cache utilisation.
KServe Inference Service — model-server replicas, request volume, autoscaler decisions.
InfiniBand or RoCEv2 fabric — link health, retransmits, congestion, courtesy of the Network Operator.

Alerting#

Pick one alerting engine per environment. Splitting alerts between Prometheus rules and Grafana Alerting is the most common cause of duplicated pages and silenced-but-still-firing incidents.

Variables and Templating#

promql

# Variable definitions
$cluster   = label_values(DCGM_FI_DEV_GPU_TEMP, cluster)
$node      = label_values(DCGM_FI_DEV_GPU_TEMP{cluster="$cluster"}, node)
$gpu       = label_values(DCGM_FI_DEV_GPU_TEMP{cluster="$cluster", node="$node"}, gpu)

# Panel query using all three
avg(DCGM_FI_PROF_SM_OCCUPANCY{cluster="$cluster", node="$node", gpu="$gpu"})

Licensing#

Where Grafana Fits#

References

Grafana Documentation · Grafana Labs
Grafana on GitHub · GitHub
NVIDIA DCGM Dashboard · Grafana Dashboards

Grafana

What Grafana Is#

Data Sources#

Dashboards for AI Clusters#

Alerting#

Variables and Templating#

Licensing#

Where Grafana Fits#

References

Browse all entries

Deploy on Yobitel

Grafana

What Grafana Is#

Data Sources#

Dashboards for AI Clusters#

Alerting#

Variables and Templating#

Licensing#

Where Grafana Fits#

References

Browse all entries

Deploy on Yobitel