高级 RAG —— 使用 Gemini 和长上下文为富文档（PDF、HTML…）建立索引

A very common question I get when presenting and talking about advanced RAG (Retrieval Augmented Generation) techniques, is how to best index and search rich documents like PDF (or web pages), that contain both text and rich elements, like pictures or diagrams.

在介绍和讨论高级 RAG（Retrieval Augmented Generation）技术时，一个非常常见的问题是：如何最好地索引和搜索包含文本和丰富元素（如图片或图表）的富文档（如 PDF 或网页）。

Another very frequent question that people ask me is about RAG versus long context windows. Indeed, models with long context windows usually have a more global understanding of a document, and each excerpt in its overall context. But of course, you can’t feed all the documents of your users or customers in one single augmented prompt. Also, RAG has other advantages like offering a much lower latency, and is generally cheaper.

人们经常问我的另一个问题是 RAG 与长上下文窗口。确实，拥有长上下文窗口的模型通常对文档有更全局的理解，并能将每个摘录放在整体上下文中理解。但显然，你无法把所有用户或客户的文档一次性塞进一个增强提示里。此外，RAG 还有其他优势，比如延迟更低，通常也更便宜。

However, the answer I usually give is that you can take the best of both worlds, with a hybrid approach:

然而，我通常给出的答案是，你可以通过 混合方法 兼顾两者的优点：

You can use a RAG approach (or a mix of keyword search engine, graph RAG, or vector-based RAG) to find relevant documents for the user query.
你可以使用 RAG 方法（或结合关键词搜索引擎、图 RAG 或基于向量的 RAG）来为用户查询找到相关文档。
And then feed only those key documents in the context window of a model like Gemini that accepts 1+ million tokens.
然后仅将这些关键文档送入像Gemini这样支持 100 万+ token 的模型的上下文窗口。

That way, the model can focus on whole documents, with a finer understanding of each one, but is not overwhelmed with too many documents to find the needle in the haystack.

这样，模型可以专注于整份文档，对每份文档有更细致的理解，但又不会被太多文档淹没，从而难以大海捞针。

The current trending topic of context engineering is exactly about this: it’s crucial to give the best contextual information to your favorite LLM!
当前热门的上下文工程话题正是关于这一点：向你最喜欢的 LLM 提供最佳的上下文信息至关重要！

Using Gemini to finely index a rich PDF document for RAG search

使用 Gemini 为丰富的 PDF 文档进行精细索引以用于 RAG 搜索

Before feeding the relevant document in the context window of the LLM, we first need to index it.

在将相关文档送...