TL;DR
- Open-source (MIT) LLM observability and engineering platform from Langfuse GmbH, first released in 2023. Available self-hosted or as managed cloud (langfuse.com).
- Four core capabilities: nested tracing of LLM calls, evaluation runs against datasets, versioned prompt management, and usage / cost analytics.
- Native OpenTelemetry support — applications instrumented with OTel + the OpenInference conventions stream traces directly to Langfuse without a separate SDK. The Langfuse SDKs (Python, JS) add LLM-specific helpers on top.
- Integrates upstream with LangChain, LlamaIndex, LiteLLM, OpenAI SDK, Anthropic SDK, Vercel AI SDK, and DSPy. Designed for engineers shipping LLM features, not infrastructure operators.
What Langfuse Tracks#
Langfuse models LLM workloads as nested traces. A user request becomes a trace; the orchestration steps (retrieval, prompt construction, LLM call, tool use, response parsing) become spans; the actual LLM completions become generations with prompt, completion, model, latency, and token usage attached. The result is a structured record of every interaction your application had with an LLM, queryable and replayable.
On top of traces it layers four product surfaces: a session view that groups traces by user, a dataset and evaluation runner for offline quality work, a prompt registry with versioning and A/B testing, and analytics dashboards for cost, latency, and quality trends over time.
Python Integration#
The lightest-weight integration is the OpenAI drop-in. Replace `import openai` with `from langfuse.openai import openai` and every call is automatically traced with prompt, completion, model, latency, and token usage.
from langfuse.openai import openai
from langfuse.decorators import observe
@observe()
def answer_question(question: str, docs: list[str]) -> str:
completion = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer using only the provided documents."},
{"role": "user", "content": f"Docs: {docs}\n\nQuestion: {question}"},
],
)
return completion.choices[0].message.content
# Each call to answer_question() now produces a Langfuse trace
# with the nested OpenAI generation, prompt, completion, tokens, and cost.Evaluation#
Langfuse supports three evaluation patterns. LLM-as-judge runs a configurable evaluator prompt against every production trace and writes a score (relevance, factuality, toxicity). Code-based scores let you attach arbitrary functions (exact match, BLEU, regex check). User feedback hooks (thumbs up/down, explicit ratings) flow through the same scoring API.
Datasets group representative inputs (and optionally expected outputs) for offline runs. An experiment executes the current application against the dataset and writes a labelled run that can be compared against previous versions — the basis of any non-vibes-based LLM iteration loop.
Prompt Management#
Prompts live in Langfuse as versioned records with labels (`production`, `staging`, `experiment-v3`). The SDK fetches by label, caches locally, and reports which version was used as part of every trace. The combination — prompt versioning plus run-level association — is what makes A/B testing of prompts a non-archaeological exercise.
Wire prompt versions into evaluation runs. The point of versioning is to answer 'did prompt v17 beat v16 on the eval dataset?' — and Langfuse's prompt and evaluation registries are joined for exactly this.
Self-Hosting#
Self-hosted Langfuse runs on PostgreSQL plus Clickhouse plus Redis plus S3-compatible object storage. A Docker Compose file covers small deployments; a Helm chart covers Kubernetes. For most teams the managed cloud is cheaper than running it themselves until trace volume crosses tens of millions per month or sovereignty constraints force on-prem.
- PostgreSQL — application state, projects, datasets, prompts.
- Clickhouse — high-volume trace and observation storage.
- Redis — async job queue (LLM-as-judge evaluations, batch ingest).
- S3 / MinIO — large attachments (images, attachments, exports).
Langfuse vs Other LLM Observability#
Langfuse, Helicone, and Phoenix overlap heavily — all three trace LLM calls, evaluate quality, and visualise cost. The differences are emphasis. Langfuse leans hardest into prompt management and engineering workflows. Helicone is the lightest to adopt because it works as a proxy. Phoenix is the strongest on offline experimentation, retrieval evaluation, and ML model observability beyond LLMs.
All three can coexist with general-purpose observability (OpenTelemetry, Prometheus, Grafana) — they answer LLM-product questions that the infrastructure layer is the wrong tool for.
References
- Langfuse Documentation · Langfuse
- Langfuse on GitHub · GitHub
- Self-Hosting Guide · Langfuse