TL;DR
- Open-source Python evaluation framework by Confident AI, designed to feel like pytest — metrics are assertions, test cases are unit tests, and the framework integrates with CI runners out of the box.
- Ships with a broad catalogue of pre-built metrics: G-Eval (custom LLM-as-judge), faithfulness, answer relevancy, contextual precision/recall, hallucination, bias, toxicity, task completion, tool correctness.
- Designed for component testing during development and regression testing in CI; the companion hosted product (Confident AI) layers dataset management, comparison runs, and team workflows.
- Integrates with the major frameworks (LangChain, LlamaIndex, LangGraph) and the major model providers; metrics are model-agnostic and run against any callable.
The pytest Analogy#
DeepEval's pitch is that LLM evaluation should feel like the test pyramid most engineers already understand. You write test cases that bundle inputs, expected outputs, and metrics. You decorate them with `@pytest.mark.parametrize` if you want to iterate over a dataset. You run them with `pytest` or with DeepEval's own CLI. A failed metric raises an assertion error and breaks the build.
This makes evaluation legible to engineers who already do test-driven development. The metrics are richer than `assert actual == expected` — they are LLM-as-judge graders, semantic similarity scores, or programmatic checks — but the workflow is familiar.
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
def test_answer_quality():
test_case = LLMTestCase(
input="What is paged attention?",
actual_output=my_rag_pipeline("What is paged attention?"),
retrieval_context=[paged_attention_doc],
)
relevance = AnswerRelevancyMetric(threshold=0.8)
faithfulness = FaithfulnessMetric(threshold=0.9)
assert_test(test_case, [relevance, faithfulness])Built-in Metrics#
- G-Eval — custom LLM-as-judge metric defined by a natural-language evaluation rubric and a list of evaluation steps.
- Faithfulness — does the answer make claims grounded in the retrieved context.
- Answer Relevancy — is the answer relevant to the question.
- Contextual Precision / Recall / Relevancy — did the retriever pick the right context.
- Hallucination — does the answer contradict the provided context.
- Bias / Toxicity — categorical safety metrics with LLM judges.
- Task Completion / Tool Correctness — agent-specific metrics for trajectory evaluation.
- Summarisation — coverage and alignment metrics for summarisation tasks.
G-Eval and Custom Metrics#
G-Eval is DeepEval's flexible LLM-as-judge metric. You define an evaluation rubric in natural language and a list of evaluation steps; G-Eval generates a chain-of-thought, scores the output, and returns a normalised score. It is the right tool when you need a metric specific to your domain ("is the answer in the right tone for our brand", "does the response include the required disclaimer") that no off-the-shelf metric covers.
Custom metrics can also be implemented as plain Python classes inheriting from `BaseMetric`. Programmatic metrics — regex checks, length constraints, schema validation — should be implemented this way rather than as LLM judges.
Datasets and Synthesisers#
DeepEval includes a dataset synthesiser that generates evaluation examples from a corpus of source documents — useful for bootstrapping a RAG evaluation set from your own knowledge base before you have production data. The quality of synthesised examples is good for coverage, less good for nuance; treat them as the floor of your dataset, not the ceiling.
Synthesised datasets are a fine starting point but should never be your only dataset. Production data and adversarial examples authored by humans are where the highest-signal test cases come from.
When to Pick DeepEval#
Pick DeepEval when you want an open-source, code-first evaluation framework that fits naturally into a pytest-based development workflow. Pick Ragas when your application is exclusively RAG and you want the academically grounded RAG metric set. Pick a hosted platform (LangSmith, Braintrust) when team workflows — annotation queues, comparison UIs, dataset governance — outweigh the value of staying open source.
References
- DeepEval Documentation · Confident AI
- DeepEval on GitHub · GitHub