使用RAGAs评估RAG应用程序
RAGAs (Retrieval-Augmented Generation Assessment) is a framework (GitHub, Docs) that provides you with the necessary ingredients to help you evaluate your RAG pipeline on a component level.
RAGAs(Retrieval-Augmented Generation Assessment)是一个框架(GitHub,Docs),为您提供了必要的元素,帮助您在组件级别上评估您的RAG管道。
Evaluation Data
评估数据
What’s interesting about RAGAs is that it started out as a framework for “reference-free” evaluation [1]. That means, instead of having to rely on human-annotated ground truth labels in the evaluation dataset, RAGAs leverages LLMs under the hood to conduct the evaluations.
RAGAs有趣的地方在于它最初是一个用于“无参考”评估的框架[1]。这意味着,不需要依赖于评估数据集中人工标注的基准标签,RAGAs在内部利用LLMs进行评估。
To evaluate the RAG pipeline, RAGAs expects the following information:
为了评估 RAG pipeline,RAGAs 需要以下信息:
-
question
: The user query that is the input of the RAG pipeline. The input. -
question
: 用户查询,作为 RAG 流程的输入。 -
answer
: The generated answer from the RAG pipeline. The output. -
answer
:RAG管道生成的答案。输出结果。 -
contexts
: The contexts retrieved from the external knowledge source used to answer thequestion
. -
contexts
:从外部知识源检索到的上下文,用于回答question
。 -
ground_truths
: The ground truth answer to thequestion
. This is the only human-annotated information. This information is only required for the metriccontext_recall
(see Evaluation Metrics). -
ground_truths
:对question
的人工标注的基准答案。这是唯一的人工标注信息。此信息仅在度量标准context_recall
(参见Evaluation Metrics)中需要。
Leveraging LLMs for reference-free evaluation is an active research topic. While using as little human-annotated data as possible makes it a cheaper and faster evaluation method, there is still some discussion about its shortcomings, such as bias [3]. However, some papers have already shown promising results [4]. For detailed information, see the “Related Work” section of the RAGAs [1] paper.
利用 LLMs 进行无参考评估是一个活跃的研究课题。尽可能少地使用人工标注数据使其成为一种更便宜、更快速的评估方法,但关于其缺点(如偏见 [3])仍存在一些讨论。然而,一些论文已经展示了有希望的结果 [4]。有关详细信息,请参阅 RAGAs [1] 论文的“相关工作”部分。
Note that the framework has expanded to provide metrics and paradigms that require ground truth labels (e.g., context_recall
and ...