使用 Sentence Transformers v3 训练和微调嵌入模型
Published May 28, 2024
发布于 2024 年 5 月 28 日
Sentence Transformers is a Python library for using and training embedding models for a wide range of applications, such as retrieval augmented generation, semantic search, semantic textual similarity, paraphrase mining, and more. Its v3.0 update is the largest since the project's inception, introducing a new training approach. In this blogpost, I'll show you how to use it to finetune Sentence Transformer models to improve their performance on specific tasks. You can also use this method to train new Sentence Transformer models from scratch.
Sentence Transformers 是一个用于使用和训练嵌入模型的 Python 库,适用于广泛的应用,如检索增强生成、语义搜索、语义文本相似性、释义挖掘等。其 v3.0 更新是自项目开始以来最大的,介绍了一种新的训练方法。在这篇博客中,我将向您展示如何使用它来微调 Sentence Transformer 模型,以提高它们在特定任务上的性能。您还可以使用此方法从头开始训练新的 Sentence Transformer 模型。
Finetuning Sentence Transformers now involves several components, including datasets, loss functions, training arguments, evaluators, and the new trainer itself. I'll go through each of these components in detail and provide examples of how to use them to train effective models.
微调句子变换器现在涉及多个组件,包括数据集、损失函数、训练参数、评估器和新的训练器本身。我将详细介绍这些组件,并提供如何使用它们来训练有效模型的示例。
Table of Contents
目录
Why Finetune?
为什么微调?
Finetuning Sentence Transformer models can significantly enhance their performance on specific tasks. This is because each task requires a unique notion of similarity. Let's consider a couple of news article headlines as an example:
微调句子变换器模型可以显著提高它们在特定任务上的表现。这是因为每个任务需要独特的相似性概念。让我们考虑几个新闻文章标题作为示例:
- "Apple launches the new iPad"
- "苹果推出新款 iPad"
- "NVIDIA is gearing up for the next GPU generation"
- "NVIDIA 正在为下一代 GPU 做准备"
Depending on the use case, we might want similar or dissimilar embeddings for these texts. For instance, a classification model for news articles could treat these texts as similar since they both belong to the Technology category. On the other hand, a semantic textual similarity or retrieval model should consider them dissimilar due to their distinct meanings.
根据使用案例,我们可能希望这些文本具有相似或不相似的嵌入。例如,...