长上下文嵌入模型中的后期分块

[

](https://jina.ai/news/what-late-chunking-really-is-and-what-its-not-part-ii)

New! Part II: deep dive into boundary cues and misconception.

新！第二部分：深入探讨边界线索和误解。

About a year ago, in October 2023, we released the world's first open-source embedding model with an 8K context length, jina-embeddings-v2-base-en. Since then, there has been quite some debate about the usefulness of long-context in embedding models. For many applications, encoding a document thousands of words long into a single embedding representation is not ideal. Many use cases require retrieving smaller portions of the text, and dense vector-based retrieval systems often perform better with smaller text segments, as the semantics are less likely to be "over-compressed" in the embedding vectors.

大约一年前，在2023年10月，我们发布了世界上第一个开源的8K上下文长度嵌入模型，jina-embeddings-v2-base-en。自那时以来，关于长上下文在嵌入模型中的实用性进行了相当多的讨论。对于许多应用，将数千个单词长的文档编码为单个嵌入表示并不理想。许多用例需要检索文本的较小部分，而基于密集向量的检索系统通常在较小的文本片段上表现更好，因为语义在嵌入向量中不太可能被“过度压缩”。

Retrieval-Augmented Generation (RAG) is one of the most well-known applications that requires splitting documents into smaller text chunks (say within 512 tokens). These chunks are usually stored in a vector database, with vector representations generated by a text embedding model. During runtime, the same embedding model encodes a query into a vector representation, which is then used to identify relevant stored text chunks. These chunks are subsequently passed to a large language model (LLM), which synthesizes a response to the query based on the retrieved texts.

检索增强生成（RAG）是最著名的应用之一，它需要将文档拆分为较小的文本块（例如在 512 个标记内）。这些块通常存储在向量数据库中，向量表示由文本嵌入模型生成。在运行时，同一嵌入模型将查询编码为向量表示，然后用于识别相关的存储文本块。这些块随后被传递给大型语言模型（LLM），该模型根据检索到的文本合成对查询的响应。

Flowchart detailing a query processing system, starting from "Query" to "Document Chunks" and "Embedding Model," then to "Vec

A typical RAG pipeline of chunking-embedding-retrieving-generating.

典型的RAG管道：分块-嵌入-检索-生成。

In short, embedding smaller chunks seems to be more preferable, partly due to the limited input sizes of downstream LLMs, but also because there’s a concern that important context...