晚期Chunking：在长上下文检索中平衡精确性和成本

Large-scale RAG applications that require long context retrieval deal with a unique set of challenges. The volume of data is often huge, while the precision of the retrieval system is critical. However, ensuring high-quality retrieval in such systems is often at odds with cost and performance. For users this can present a difficult balancing act. But there may be a new solution that can even the scales.

需要长时间上下文检索的大规模RAG应用面临着一系列独特的挑战。数据量通常很大，而检索系统的精确性至关重要。然而，在这种系统中确保高质量的检索往往与成本和性能相冲突。对于用户来说，这可能是一个难以平衡的行为。但是可能有一个新的解决方案可以平衡这些问题。

Two weeks ago, JinaAI announced a new methodology to aid in long-context retrieval called late chunking. This article explores why late chunking may just be the happy medium between naive (but inexpensive) solutions and more sophisticated (but costly) solutions like ColBERT for users looking to build high-quality retrieval systems on long documents.

两周前，JinaAI宣布了一种新的方法论，以帮助长上下文检索，称为late chunking。本文探讨了为用户构建高质量长文档检索系统时，late chunking可能是天真（但廉价）解决方案和ColBERT等更复杂（但昂贵）解决方案之间的折中。

What is Late Chunking?

什么是延迟分块？

Chunking Strategies

Late chunking is a novel approach that aims to preserve contextual information across large documents by inverting the traditional order of embedding and chunking. The key distinction lies in when the chunking occurs:

晚期分块是一种新颖的方法，通过颠倒传统的嵌入和分块顺序，旨在保留大型文档中的上下文信息。关键区别在于分块发生的时间：

Traditional approach: Chunk first, then embed each chunk separately.
传统方法：先分块，然后分别嵌入每个分块。
Late chunking: Embed the entire document first, then chunk the embeddings.
延迟分块：先嵌入整个文档，然后分块嵌入。

This method utilizes a long context embedding model to create token embeddings for every token in a document. These token-level embeddings are then broken up and pooled into multiple embeddings representing each chunk in the text.

该方法利用长上下文嵌入模型为文档中的每个标记创建标记嵌入。然后，将这些标记级别的嵌入分解并汇总为表示文本中每个块的多个嵌入。

In typical setups, all token embeddings would be pooled (mean, cls, etc.) into a single vector representation for the entire document. However, with the rise of RAG applications, there's growing concern that a single vector for long documents do...

晚期Chunking：在长上下文检索中平衡精确性和成本