使用LLMs进行合成数据生成:权威指南

Filtering Synthetic Data

过滤合成数据

Before you begin evolving your newly generated datasets, it’s essential to conduct thorough quality checks to avoid refining inputs that are inherently flawed. This step is crucial to ensure that no valuable resources are wasted and that your final dataset only contains high-quality goldens.

在您开始处理新生成的数据集之前,进行彻底的质量检查是至关重要的,以避免精炼本质上有缺陷的输入。这一步骤对于确保没有宝贵资源被浪费以及您的最终数据集仅包含高质量的金标准至关重要。

Filtering occurs at two critical stages of synthetic data generation: initially during context generation, and subsequently during the generation of synthetic inputs from these contexts.

过滤发生在合成数据生成的两个关键阶段:最初是在上下文生成期间,随后是在从这些上下文生成合成输入期间。

Context Filtering

上下文过滤

During context generation, there’s a chance you might randomly select a low-quality chunk. Oftentimes, your knowledge base may contain complex structures or excess whitespace that becomes unintelligible when broken down. Employing LLMs as judges is a robust method for identifying and eliminating these low-quality contexts.

在上下文生成过程中,您可能会随机选择一个低质量的片段。通常情况下,您的知识库可能包含复杂的结构或多余的空白,这在分解时变得难以理解。 将LLM作为评判者 是识别和消除这些低质量上下文的有效方法。

Context Filtering Example

上下文过滤示例

You may customize the criteria for evaluating and filtering out these contexts, but here are some foundational guidelines to consider:

您可以自定义评估和过滤这些上下文的标准,但这里有一些基础指导方针供您考虑:

  • Clarity: Evaluate how clear and understandable the information is.
  • 清晰度:评估信息的清晰和可理解程度。
  • Depth: Assess the level of detailed analysis and presence of original insights.
  • 深度:评估详细分析的水平和原创见解的存在。
  • Structure: Review the organization and logical progression of the content.
  • 结构:审查内容的组织和逻辑进展。
  • Relevance: Determine the content’s pertinence to the main topic.
  • 相关性:确定内容与主题的相关性。
  • Precision: Gauge the accuracy and attention to detail.
  • 精确性:评估准确性和对细节的关注。
  • Novelty: Assess the uniqueness and originality of the content.
  • 新颖性: 评估内容的独特性和原创性。
  • Conciseness: Evaluate the brevity and efficiency of the communication.
  • 简洁性: 评估沟通的简短性和效率。
  • Impact: Judge the potential effect of the content on the audience.
  • 影响: 判断内容对受众的潜在影响。

You’ll also need to ensure that the remainin...

开通本站会员,查看完整译文。

Home - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.0. UTC+08:00, 2025-02-21 07:33
浙ICP备14020137号-1 $Map of visitor$