DeepEval

TL;DR

Open-source Python evaluation framework by Confident AI, designed to feel like pytest — metrics are assertions, test cases are unit tests, and the framework integrates with CI runners out of the box.
Ships with a broad catalogue of pre-built metrics: G-Eval (custom LLM-as-judge), faithfulness, answer relevancy, contextual precision/recall, hallucination, bias, toxicity, task completion, tool correctness.
Designed for component testing during development and regression testing in CI; the companion hosted product (Confident AI) layers dataset management, comparison runs, and team workflows.
Integrates with the major frameworks (LangChain, LlamaIndex, LangGraph) and the major model providers; metrics are model-agnostic and run against any callable.

The pytest Analogy#

DeepEval's pitch is that LLM evaluation should feel like the test pyramid most engineers already understand. You write test cases that bundle inputs, expected outputs, and metrics. You decorate them with `@pytest.mark.parametrize` if you want to iterate over a dataset. You run them with `pytest` or with DeepEval's own CLI. A failed metric raises an assertion error and breaks the build.

This makes evaluation legible to engineers who already do test-driven development. The metrics are richer than `assert actual == expected` — they are LLM-as-judge graders, semantic similarity scores, or programmatic checks — but the workflow is familiar.

python

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

def test_answer_quality():
    test_case = LLMTestCase(
        input="What is paged attention?",
        actual_output=my_rag_pipeline("What is paged attention?"),
        retrieval_context=[paged_attention_doc],
    )
    relevance = AnswerRelevancyMetric(threshold=0.8)
    faithfulness = FaithfulnessMetric(threshold=0.9)
    assert_test(test_case, [relevance, faithfulness])

Built-in Metrics#

G-Eval — custom LLM-as-judge metric defined by a natural-language evaluation rubric and a list of evaluation steps.
Faithfulness — does the answer make claims grounded in the retrieved context.
Answer Relevancy — is the answer relevant to the question.
Contextual Precision / Recall / Relevancy — did the retriever pick the right context.
Hallucination — does the answer contradict the provided context.
Bias / Toxicity — categorical safety metrics with LLM judges.
Task Completion / Tool Correctness — agent-specific metrics for trajectory evaluation.
Summarisation — coverage and alignment metrics for summarisation tasks.

G-Eval and Custom Metrics#

G-Eval is DeepEval's flexible LLM-as-judge metric. You define an evaluation rubric in natural language and a list of evaluation steps; G-Eval generates a chain-of-thought, scores the output, and returns a normalised score. It is the right tool when you need a metric specific to your domain ("is the answer in the right tone for our brand", "does the response include the required disclaimer") that no off-the-shelf metric covers.

Custom metrics can also be implemented as plain Python classes inheriting from `BaseMetric`. Programmatic metrics — regex checks, length constraints, schema validation — should be implemented this way rather than as LLM judges.

Datasets and Synthesisers#

DeepEval includes a dataset synthesiser that generates evaluation examples from a corpus of source documents — useful for bootstrapping a RAG evaluation set from your own knowledge base before you have production data. The quality of synthesised examples is good for coverage, less good for nuance; treat them as the floor of your dataset, not the ceiling.

Synthesised datasets are a fine starting point but should never be your only dataset. Production data and adversarial examples authored by humans are where the highest-signal test cases come from.

When to Pick DeepEval#

Pick DeepEval when you want an open-source, code-first evaluation framework that fits naturally into a pytest-based development workflow. Pick Ragas when your application is exclusively RAG and you want the academically grounded RAG metric set. Pick a hosted platform (LangSmith, Braintrust) when team workflows — annotation queues, comparison UIs, dataset governance — outweigh the value of staying open source.

References

DeepEval Documentation · Confident AI
DeepEval on GitHub · GitHub

The pytest Analogy#

python

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

def test_answer_quality():
    test_case = LLMTestCase(
        input="What is paged attention?",
        actual_output=my_rag_pipeline("What is paged attention?"),
        retrieval_context=[paged_attention_doc],
    )
    relevance = AnswerRelevancyMetric(threshold=0.8)
    faithfulness = FaithfulnessMetric(threshold=0.9)
    assert_test(test_case, [relevance, faithfulness])

Built-in Metrics#

G-Eval — custom LLM-as-judge metric defined by a natural-language evaluation rubric and a list of evaluation steps.

Faithfulness — does the answer make claims grounded in the retrieved context.

Answer Relevancy — is the answer relevant to the question.

Contextual Precision / Recall / Relevancy — did the retriever pick the right context.

Hallucination — does the answer contradict the provided context.

Bias / Toxicity — categorical safety metrics with LLM judges.

Task Completion / Tool Correctness — agent-specific metrics for trajectory evaluation.

Summarisation — coverage and alignment metrics for summarisation tasks.

G-Eval and Custom Metrics#

Datasets and Synthesisers#

Synthesised datasets are a fine starting point but should never be your only dataset. Production data and adversarial examples authored by humans are where the highest-signal test cases come from.

When to Pick DeepEval#

DeepEval

The pytest Analogy#

Built-in Metrics#

G-Eval and Custom Metrics#

Datasets and Synthesisers#

When to Pick DeepEval#

References

Browse all entries

Deploy on Yobitel

DeepEval

The pytest Analogy#

Built-in Metrics#

G-Eval and Custom Metrics#

Datasets and Synthesisers#

When to Pick DeepEval#

References

Browse all entries

Deploy on Yobitel