使用DeepEval進行單元測試LLM

10 min readAug 6, 2024

在過去的一年裡，我一直在與不同的LLM（OpenAI、Claude、Palm、Gemini等）合作，他們的表現給我留下了深刻的印象。隨著人工智慧的快速進步和LLM的複雜性越來越大，擁有一個可靠的測試框架變得至關重要，該框架可以幫助我們保持提示的品質，並確保為使用者提供最佳結果。最近，在RAGAS評估RAG pipeline工具之後，我發現了DeepEval（https://github.com/confident-ai/deepeval），這是一個LLM測試框架，它徹底改變了我們處理即時品質保證的方式。

DeepEval：一個全面的LLM測試框架
DeepEval是一個專門為測試LLM品質而設計的開源框架。它為“單元測試”LLM輸出提供了一種簡單直觀的方式，類似於開發人員如何使用Pytest進行傳統軟體測試。使用DeepEval，您可以輕鬆建立測試用例，定義指標，並評估LLM應用程式的效能。

DeepEval的主要優勢之一是其廣泛的插拔和使用指標集合，有超過14個LLM評估指標以研究為後盾。這些指標涵蓋了廣泛的用例，允許您評估LLM績效的各個方面，如答案相關性、忠誠度和幻覺。此外，DeepEval提供靈活性，可根據您的特定需求定製指標，確保您可以徹底評估您的LLM應用程式。

程式示例：使用DeepEval評估LLM輸出
上週末，我花了一些時間整理了一個與OpenAI GPT-3.5和Claude Haiku（在AWS Bedrock上執行）一起工作的DeepEval單元測試的示例。在這個例子中，我想測試它是否能夠總結一小段文字，並使用以下指標評估resposne：

答案相關性 (Answer relavancy)
彙總指標 (Summary metrics)
延遲度指標 (Latency metrics)

第一部分載入dependency項並設定要測試的模擬資料

import asyncio
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, SummarizationMetric, LatencyMetric
from .deepeval_test_utils import send

mock_data = {"input": "The 'coverage score' is calculated as the percentage of assessment questions for which both the summary and the original document provide a 'yes' answer. This method ensures that the summary not only includes key information from the original text but also accurately represents it. A higher coverage score indicates a more comprehensive and faithful summary, signifying that the summary effectively encapsulates the crucial points and details from the original content.", 
"expected_output": "The coverage score quantifies how well a summary captures and accurately represents key information from the original text, with a higher score indicating greater comprehensiveness."}

下一組python程式用於載入給模型的特定提示，並返回LLM模型的任何配置

def get_summary_prompt(input: str, model_type: LanguageModelType):
    if model_type == LanguageModelType.GPT_35:
        return _get_openai_summary_prompt(input)
    else:
        return _get_claude_summary_prompt(input)


def _get_openai_summary_prompt(input: str):
    config = {}
    config["top_p"] = 1
    config["max_tokens"] = 1000
    config["temperature"] = 1
    config["model"] = "gpt-3.5-turbo-16k"

    print("input", input)
    prompt = [
        {
            "role": "system",
            "content": "You are a knowledge base AI. Please summarize the following text:",
        },
        {
            "role": "user",
            "content": input
        }
    ]
    return (prompt, config)


def _get_claude_summary_prompt(input: str):
    prompt = f"""Human: Please summarize the following text: {input}
             Assistant: """
    config = {}
    config["temperature"] = 1
    config["top_p"] = 0.999
    config["top_k"] = 350
    config["model_id"] = "anthropic.claude-3-haiku-20240307-v1:0"
    return (prompt, config)

下一個程式塊是主要函式，它將根據模型檢索提示，然後將其傳送到特定的LLM（OpenAI或Claude並返回響應）。

def test_openai_summary():
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    prompt, result = loop.run_until_complete(
        get_summary(LanguageModelType.GPT_35)
    )
    answer_relevancy_metric = AnswerRelevancyMetric(
        threshold=0.5, model="gpt-3.5-turbo", include_reason=True
    )
    summary_metric = SummarizationMetric(
        threshold=0.5,
        model="gpt-4",
        assessment_questions=[
            "Is the coverage score based on a percentage of 'yes' answers?",
            "Does the score ensure the summary's accuracy with the source?",
            "Does a higher score mean a more comprehensive summary?"
        ]
    )

    latency_metric = LatencyMetric(max_latency=7.0)

    test_case = LLMTestCase(
        input=mock_data["input"],
        actual_output=str(result),
        expected_output=mock_data["expected_output"],
        latency=6.0
    )
    assert_test(test_case, metrics=[answer_relevancy_metric, summary_metric, latency_metric])

下一個程式塊用於針對Claude LLM測試摘要程式

def test_claude_summary():
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    prompt, result = loop.run_until_complete(
        get_summary(LanguageModelType.CLAUDE)
    )
    answer_relevancy_metric = AnswerRelevancyMetric(
        threshold=0.5, model="gpt-4", include_reason=True
    )

    summary_metric = SummarizationMetric(
        threshold=0.5,
        model="gpt-4",
        assessment_questions=[
            "Is the coverage score based on a percentage of 'yes' answers?",
            "Does the score ensure the summary's accuracy with the source?",
            "Does a higher score mean a more comprehensive summary?"
        ]
    )

    latency_metric = LatencyMetric(max_latency=7.0)

    test_case = LLMTestCase(
        input=str(prompt),
        actual_output=str(result),
        expected_output=mock_data["expected_output"],
        latency=6.0
    )
    assert_test(test_case, metrics=[answer_relevancy_metric, summary_metric, latency_metric])

要執行響應，您只需使用以下命令在終端機上。

deepeval test run test_sample.py

記住，一定要deepeval test run當開頭替代python命令。

在本例中，我們使用LLMTestCase類建立測試用例，指定輸入提示和LLM應用程式生成的實際輸出。然後，我們定義了3個指標，這些指標將用於評估LLM輸出的相關性。最後，我們使用assert_test函式來執行評估，並確保輸出符合指定的標準。

DeepEval 改變了我們團隊在確保 LLM 應用程式品質方面的遊戲規則。透過提供具有廣泛指標和合成資料集生成能力的綜合測試框架，DeepEval簡化了我們的測試流程，並賦予我們輕鬆部署新版本LLM的信心。

對於不熟悉人工智慧和LLM的應用程式開發人員和工程師，DeepEval提供了一種簡單易用的方式來評估您的LLM應用程式的效能。透過利用DeepEval的力量，您可以專注於開發高品質的API整合，同時確保您的LLM提示保持有效和可靠。

我強烈建議您探索DeepEval並將其納入您的LLM測試工作流程。憑藉其廣泛的功能和使用者友好的介面，DeepEval是任何使用LLM的人的寶貴工具，無論他們在人工智慧和資料科學方面的專業知識水準如何。

使用DeepEval進行單元測試LLM

Written by KevinLuo