长上下文嵌入模型中的后期分块
[
[
](https://jina.ai/news/what-late-chunking-really-is-and-what-its-not-part-ii)
](https://jina.ai/news/what-late-chunking-really-is-and-what-its-not-part-ii)
New! Part II: deep dive into boundary cues and misconception.
新!第二部分:深入探讨边界线索和误解。
About a year ago, in October 2023, we released the world's first open-source embedding model with an 8K context length, jina-embeddings-v2-base-en. Since then, there has been quite some debate about the usefulness of long-context in embedding models. For many applications, encoding a document thousands of words long into a single embedding representation is not ideal. Many use cases require retrieving smaller portions of the text, and dense vector-based retrieval systems often perform better with smaller text segments, as the semantics are less likely to be "over-compressed" in the embedding vectors.
大约一年前,在2023年10月,我们发布了 世界上第一个开源的8K上下文长度嵌入模型,jina-embeddings-v2-base-en。自那时以来,关于长上下文在嵌入模型中的实用性进行了相当多的讨论。对于许多应用,将数千个单词长的文档编码为单个嵌入表示并不理想。许多用例需要检索文本的较小部分,而基于密集向量的检索系统通常在较小的文本片段上表现更好,因为语义在嵌入向量中不太可能被“过度压缩”。
Retrieval-Augmented Generation (RAG) is one of the most well-known applications that requires splitting documents into smaller text chunks (say within 512 tokens). These chunks are usually stored in a vector database, with vector representations generated by a text embedding model. During runtime, the same embedding model encodes a query into a vector representation, which is then used to identify relevant stored text chunks. These chunks are subsequently passed to a large language model (LLM), which synthesizes a response to the query based on the retrieved texts.
检索增强生成(RAG)是最著名的应用之一,它需要将文档拆分为较小的文本块(例如在 512 个标记内)。这些块通常存储在向量数据库中,向量表示由文本嵌入模型生成。在运行时,同一嵌入模型将查询编码为向量表示,然后用于识别相关的存储文本块。这些块随后被传递给大型语言模型(LLM),该模型根据检索到的文本合成对查询的响应。
A typical RAG pipeline of chunking-embedding-retrieving-generating.
典型的RAG管道:分块-嵌入-检索-生成。
In short, embedding smaller chunks seems to be more preferable, partly due to the limited input sizes of downstream LLMs, but also because there’s a concern that important context...