RAFT- Adapting Language Model to Domain Specific RAG
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#RAFT
1. Preprint, Under Review
RAFT: Adapting Language Model to Domain Specific RAG
Tianjun Zhang ∗
Department of Computer Science
UC Berkeley
Berkeley, CA 94720, USA
{tianjunz}@berkeley.edu
Shishir G. Patil, Naman Jain, Sheng Shen
Department of Computer Science
UC Berkeley
Berkeley, CA 94720, USA
{shishirpatil,naman_jain,sheng.s}@berkeley.edu
Matei Zaharia, Ion Stoica, Joseph E. Gonzalez
Department of Computer Science
UC Berkeley
Berkeley, CA 94720, USA
{matei,istoica,jegonzal}@berkeley.edu
Abstract
Pretraining Large Language Models (LLMs) on large corpora of textual
data is now a standard paradigm. When using these LLMs for many
downstream applications, it is common to additionally incorporate new in-
formation into the pretrained model either through RAG-based-prompting,
or finetuning. However, the best methodology to incorporate information
remains an open question. In this paper, we present Retrieval Augmented
Fine Tuning (RAFT), a training recipe which improves the model’s ability
to answer questions in "open-book" in-domain settings. In training RAFT,
given a question, and a set of retrieved documents, we train the model to
ignore those documents that don’t help in answering the question, which
we call, distractor documents. RAFT accomplishes this by citing verbatim
the right sequence from the relevant document to help answer the question.
This coupled with RAFT’s chain-of-thought-style response helps improve
the model’s ability to reason. In domain specific RAG, RAFT consistently
improves the model’s performance across PubMed, HotpotQA, and Gorilla
datasets, presenting a post-training recipe to improve pre-trained LLMs to
in-domain RAG.
1
Introduction
Trained on vast quantities of public data, Large Language Models LLMs have achieved
significant advances in a wide range of general knowledge reasoning tasks Brown et al.
(2020); Wei et al. (2022). However, increasingly LLMs are being employed in specialized
domains to support tasks ranging from code completion for specific software frameworks
to question answering on specific document collections (e.g., legal or medical documents).
In these settings, general knowledge reasoning is less critical and instead the primary goal
is to maximize accuracy based on a given set of documents. Indeed, adapting LLMs to the
specialized domains (e.g., recent news, enterprise private documents, or program resources
constructed after the training cutoff) is essential to many emerging applications (Vu et al.,
2023; Lazaridou et al., 2022) and is the focus of this work.
This paper studies the following question – How do we adapt pre-trained LLMs for Retrieval
Augmented Generation (RAG) in specialized domains?
When it comes to adapting LLMs to specialized domains, we consider the following two
candidates: in-context learning through Retrieval-Augmented Generation (RAG) and super-
vised fine-tuning. RAG based methods allow the LLM to reference the documents when
∗ Corresponding
author, personal website: tianjunz.github.io
1
2. Preprint, Under Review
answer
query
Teach Model to
use External Docs at Test
Model can use
External Docs at Test
Bake in Knowledge
at Train Time
answer
query
“Closed book”
“Open book”
answer
query
RAFT (Proposed)
Figure 1: How best to prepare for an Exam?(a) Fine-tuning based approaches implement
"studying" by either directly "memorizing" the input documents or answering practice
QA without referencing the documents. (b) Alternatively, in-context retrieval methods fail
to leverage the learning opportunity afforded by the fixed domain and are equivalent to
taking an open-book exam without studying. In contrast, our approach (c) RAFT leverages
fine-tuning with question-answer pairs while referencing the documents in a simulated
imperfect retrieval setting — thereby effectively preparing for the open-book exam setting.
answering questions. However, RAG based in-context learning methods fail to leverage
the learning opportunity afforded by the fixed domain setting and early access to the test
documents. Alternatively, supervised fine-tuning offers the opportunity to learn more
general patterns in the documents and better align to end tasks and user preferences Zhou
et al. (2023). However, existing fine-tuning based approaches either fail to leverage the
documents at test time (don’t incorporate RAG) or fail to account for the imperfections in
retrieval process during training.
We can draw an analogy to an open-book exam. Existing in-context retrieval methods are
equivalent to taking an open-book exam without studying. Alternatively, existing fine-
tuning based approaches implement “studying" by either directly “memorizing" Xiong
et al. (2023) the input documents or answering practice questions Wang et al. (2022) without
referencing the documents. While these approaches leverage in-domain learning they fail to
prepare for the open-book nature of the test setting.
In this paper, we study how to combine instruction fine-tuning (IFT) with retrieval aug-
mented generation (RAG). We propose a novel adaptation strategy – Retrieval-Augmented
Fine Tuning (RAFT). RAFT specifically addresses the challenge of fine-tuning LLMs to both
incorporate domain knowledge while also improving in-domain RAG performance. RAFT
aims to not only enable models to learn domain-specific knowledge through fine-tuning,
but also to ensure robustness against distracting retrieved information. This is achieved
by training the models to understand the dynamics between the question (prompt), the
domain-specific documents retrieved, and the right answer. Going back to our analogy to
the open book exam, our approach is analogous to studying for an open-book exam by
recognizing relevant, and irrelevant retrieved documents.
In RAFT, we train the model to answer the question (Q) from Document(s) (D*) to generate
answer (A*), where A* includes chain-of-thought reasoning Wei et al. (2022); Anthropic
(2023), and in the presence of distractor documents (D k ). We explain the methodology in
Section 3 and analyze the sensitivity to the number of distractor documents (k) at train- and
test- time in Section 5. RAFT consistently outperforms Supervised-finetuning both with-
and without- RAG across PubMed Dernoncourt & Lee (2017), HotPot QA Yang et al. (2018),
and HuggingFace Hub, Torch Hub, and Tensorflow Hub Gorilla datasets Patil et al. (2023),
presenting a novel, yet simple technique to improve pre-trained LLMs for in-domain RAG.
Our code is available at https://github.com/ShishirPatil/gorilla.
2
LLMs for Open-Book Exam
To understand our goal better, we expand on our analogy between training an LLM with
the real-world setting of prepararing for an exam.
Closed-Book Exam A closed book exam often refers to the scenario where the LLMs do
not have access to any additional documents or references to answer the questions during
2
3. Preprint, Under Review
Figure 2: Overview of our RAFT method. The top-left figure depicts our approach of
adapting LLMs to reading solution from a set of positive and distractor documents in
contrast to standard RAG setup where models are trained based on the retriever outputs,
which is a mixture of both memorization and reading. At test time, all methods follow the
standard RAG setting, provided with a top-k retrieved documents in the context.
the exam. For LLMs, this is equivalent to the scenario, for example, in which the LLM is
used as a chatbot. In this scenario the LLM draws from the knowledge baked in during
pre-training and supervised-finetuning to respond to the users’ prompt.
Open Book Exam In contrast, we liken the open-book exam setting to the scenario in
which the LLM can refer to external sources of information (e.g., a website or a book chapter).
In such scenarios, typically, the LLM is paired with retriever which retrieves ‘k’ documents
(or specific segments of the document) which are appended to the users’ prompt. It is
only through these documents retrieved that the LLM gains access to “domain-specific
information”. As a result, we argue that the LLM’s performance in these settings, where it
is trained as a general-purpose LLM is largely dependent on the quality of the retriever and
how accurately the retriever can identify the most relevant piece of information.
Domain-Specific Open-Book Exam In this paper, we focus on the narrower but increas-
ingly popular domain than the general open book exam, which we call the domain-specific
open-book exam. Here, we know apriori the domain in which the LLM will be tested. The
LLM can respond to the users’ prompt using use any and all information from this specific
domain, which it has been fine-tuned on. Examples of domain specific examples include
enterprise documents, code repositories belonging to an organization, etc. In all these
scenarios, the LLM will be used to respond to the questions, whose answers can be found
within a collection of documents. The retrieval technique itself has little to no-impact on the
mechanism (though it may impact the accuracy). This paper studies the domain-specific
open-book setting and how to adapt a pretrained LLM to this specific domain, including
how to make it more robust to a varying number of retrieved documents and distractors.
3
RAFT
In this section, we present RAFT, a novel way of training LLMs for domain-specific open-
book exams. We first introduce the classical technique of supervised fine-tuning, followed
with the key takeaways from our experiments. Then, we introduce RAFT , a modified
version of general instruction tuning. Lastly, we provide an overview of the experiments to
expect in the later sections.
Supervised Finetuning
Consider the supervised fine-tuning (SFT) setting for a Question-Answer dataset. The
formulation consists of the Dataset (D) from which a set of Question (Q) and corresponding
answer (A) pairs are derived or already available. In the classical SFT setting, the model is
trained to improve it’s ability to answer the questions based on it’s knowledge - obtained
either during pre-training, or during the SFT training phase. The model so trained can also
3
4. Preprint, Under Review
be used at test-time with Retrieval Augmented Generation (RAG) setting, where additional
documents can be introduced in the prompt to help the model answer the question. This
can be represented as follows:
{Train: Q → A}, {0-shot Inference: Q → A}, {RAG Inference: Q + D → A}
RAFT: Retrieval Augmented Fine-Tuning (RAFT), presents a novel recipe to prepare fine-
tuning data to tailor the models for domain-specific open-book setting, equivalent to in-
domain RAG In RAFT, we prepare the training data such that each data point contains a
question (Q), a set of documents (D k ), and a corresponding Chain-of-though style answer
(A ∗ ) generated from one of the document (D ∗ ). We differentiate between two types of
documents: ‘golden’ documents (D ∗ ) i.e. the documents from which the answer to the
question can be deduced, and ‘distractor’ documents (D i ) that do not contain answer-
relevant information. As an implementation detail, the ‘golden’ document doesn’t need to
be a single document, but can be more than one document, as is the case in HotpotQA Yang
et al. (2018). Then, for P fraction of the questions (q i ) in the dataset, we retain the golden
document (d i ∗ ) along with distractor documents (d k − 1 ). For ( 1 − P ) fraction of the questions
(q i ) in the dataset, we include no golden document and only include distractor documents
(d k ). We then fine-tune the language model using standard supervised training (SFT)
technique, training it to generate answers from the provided documents and question. Fig. 2
illustrates the high-level design principal for RAFT .
We demonstrate that our RAG approach trains the model to perform better RAG on the set
of documents it is trained on i.e., in-domain. By removing the golden documents in some
instances, we are compelling the model to memorize answers instead of deriving them from
the context. The training data for RAFT is as follows, and an example training data can be
seen in Fig. 3:
P % of data: Q + D ∗ + D 1 + D 2 + . . . + D k → A ∗
(1 − P) % of data: Q + D 1 + D 2 + . . . + D k → A ∗
Subsequently, for the test scenario, the model is provided with the Q and top-k documents
retrieved by the RAG pipeline. Note that RAFT is independent of the retriever used.
A key factor in enhancing training quality is the generation of a reasoning process, such
as Chain-of-Thought, to explain the provided answers. RAFT approach is similar: we
demonstrate that creating a full reasoning chain and in-addition, clearly citing sources
enhances the model’s accuracy in answering questions. In Fig. 3, we illustrate this set-
up. Generating the training data in this fashion, involves presenting the model with a
question, context, and verified answers, and then requesting it to form a reasoning chain
that appropriately references the original context.
For all the datasets in our experiments, we generate the answers using the technique
described above. Note that the Gorilla APIBench dataset, already includes reasoning
in the answers. We provide an example of the generation step in Fig. 3, the detailed
reasoning answer includes a citation from the original context inside ##begin_quote## and
##end_quote## as well as the detailed explanation on how to reach the conclusion based on
the citations. We demonstrate that adding detailed reasoning paragraphs can help boost the
model’s performance in our experiment section.
4
Evaluation
We design our experiments to study how well RAFT performs compared to various base-
lines. We find that the RAFT-7B model (a finetuned version of LlaMA-2) is better at reading
and extracting information from in-domain documents, than domain-specific finetuned
model, and general-purpose model with RAG. As an ablation, we also demonstrate how
important it is for the model to learn with Chain-of-Thought responses. In this section,
we will first introduce all the datasets we used in the experiments, then all the baseline
model/fine-tuning techniques that we benchmark against.
4
5. Preprint, Under Review
Question: The Oberoi family is part of a hotel company that has a head office
in what city?
context: [The Oberoi family is an Indian family that is famous for its
involvement in hotels, namely through The Oberoi Group]...[It is located in
city center of Jakarta, near Mega Kuningan, adjacent to the sister JW Marriott
Hotel. It is operated by The Ritz-Carlton Hotel Company. The complex has two
towers that comprises a hotel and the Airlangga Apartment respectively]...[The
Oberoi Group is a hotel company with its head office in Delhi.]
Instruction: Given the question, context and answer above, provide a logical
reasoning for that answer. Please use the format of: ##Reason: {reason}
##Answer: {answer}.
--------------------------------------------------------------------------------------
CoT Answer: ##Reason: The document ##begin_quote## The Oberoi family is an
Indian family that is famous for its involvement in hotels, namely through The
Oberoi Group. ##end_quote## establishes that the Oberoi family is involved in
the Oberoi group, and the document ##begin_quote## The Oberoi Group is a hotel
company with its head office in Delhi. ##end_quote## establishes the head
office of The Oberoi Group. Therefore, the Oberoi family is part of a hotel
company whose head office is in Delhi. ##Answer: Delhi
Figure 3: RAFT prompt to help LLM evaluate its own generated reasoning and answers,
contrasting them with the correct reasoning and answers. The LLM is prompted to identify
errors in its reasoning and extract key insights for improvement. This figure specifically
represents the ‘GenerateExplanation‘ step in the RAFT algorithm (Section 3).
Table 1: RAFT improves RAG performance for all specialized domains: Across PubMed,
HotPot, HuggingFace, Torch Hub, and Tensorflow Hub, we see that Domain-specific Fine-
tuning improves significantly of the performance of the base model, RAFT consistently
outperforms the existing domain-specific finetuning method with or without RAG. This
suggests the need to train the model with context. We compare our model with LLaMA
finetuning receipes, and provide GPT-3.5 for reference.
PubMed HotPot HuggingFace Torch Hub TensorFlow
GPT-3.5 + RAG 71.60 41.5 29.08 60.21 65.59
LLaMA2-7B
LLaMA2-7B + RAG
DSF
DSF + RAG 56.5
58.8
59.7
71.6 0.54
0.03
6.38
4.41 0.22
26.43
61.06
42.59 0
08.60
84.94
82.80 0
43.06
86.56
60.29
RAFT (LLaMA2-7B) 73.30 35.28 74.00 84.95 86.86
Datasets In our experiments, we use the following datasets to evaluate our model and
all baselines. We selected these datasets to represent both popular and diverse domains
including Wikipedia, Coding/API documents, and question-answering on medical docu-
ments. Natural Questions (NQ) Kwiatkowski et al. (2019), Trivia QA Joshi et al. (2017) and
HotpotQA Yang et al. (2018) are the open-domain question-answers based on Wikipedia,
mainly focused on common knowledge (e.g., movies, sports, etc). HuggingFace, Torch Hub,
and TensorFlow Hub are from the APIBench Patil et al. (2023) proposed in the Gorilla paper.
These benchmarks measure how to generate the correct, functional, and executable API
calls based on the documentation. PubMed QA Jin et al. (2019) is a question-answering
dataset tailored only for biomedical-research question-answering. It mainly focuses on
answering medical and biology questions based on a given set of documents. We would
5
6. Preprint, Under Review
like to highlight that (NQ, Trivia QA, and HotpotQA) are relatively general domain whereas
the latter two domains are on domain-specific documents.
Baselines We consider the following baselines for our experiments:
• LlaMA2-7B-chat model with 0-shot prompting: this is the commonly used
instruction-finetuned model for QA tasks, where we provide clearly written instruc-
tions, but no reference documentation.
• LlaMA2-7B-chat model with RAG (Llama2 + RAG): similar to the previous setting,
except here we include reference documents. This is a popular technique when
dealing with domain-specific QA tasks.
• Domain-Specific Finetuning with 0-shot prompting (DSF): Standard supervised-
finetuning, without documents in context. We find that its mostly useful to align
the answering style of the model as well as get familiar with the domain context.
• Domain-Specific Finetuning with RAG (DSF + RAG): Equip a domain-specific
finetuned-model with external knowledge using RAG. So, for the “knowledge” the
model does not know, it can still refer to the context.
4.1
Results
Using the above datasets and baselines, we evaluate our model RAFT and demonstrate
the effectiveness of RAFT in Tab. 1. We see that RAFT consistently and significantly
outperforms the baselines. Compared with the base Llama-2 instruction-tuned model,
RAFT with RAG does much better in terms of extracting information as well as being
robust towards distractors. The gain can be as big as 35.25% on Hotpot QA and 76.35% on
Torch Hub evaluation. Compared with DSF on the specific dataset, our model does better at
relying on the provided context to solve the problem. RAFT does much better on the tasks
like Hotpot and HuggingFace datasets (30.87% on Hotpot and 31.41% on HuggingFace).
Note that for PubMed QA, since it is a binary yes/no question, we don’t observe significant
gains when we compare our model with DSF + RAG. Even compared with a much larger
and better model GPT-3.5, RAFT demonstrates significant advantages.
Overall, the LLaMA-7B model, both with and without the RAG, performs poorly due to its
answering style not aligning with the ground truth. By applying domain-specific tuning,
we significantly enhance its performance. This process enables the model to learn and adopt
the appropriate style of answering. However, introducing RAG to a domain-specifically
fine-tuned (DSF) model doesn’t invariably lead to better outcomes. This might indicate that
the model lacks training in context processing and extracting useful information from it. By
incorporating our method, RAFT , we train the model not only to match its answering style
with that required but also to improve its document processing capabilities. Consequently,
our approach outperforms all others.
4.2
Effect of CoT
We also conduct an analysis to evaluate the effectiveness of the Chain-of-Thought approach
in enhancing the model’s performance. As indicated in Table 2, simply providing the answer
to a question may not always be adequate. This approach can lead to a rapid decrease
in loss, resulting in the model beginning to overfit. Incorporating a reasoning chain that
not only guides the model to the answer but also enriches the model’s understanding can
improve the overall accuracy and prevent overfitting to concise answers. In our experiments,
integrating the Chain-of-Thought significantly enhances training robustness. We employ
GPT-4-1106 to generate our Chain-of-Thought prompts and include an example of the
prompt we used in Figure 3.
4.3
Qualitative Analysis
To illustrate the potential advantages of RAFT over the domain-specifically fine-tuned
(DSF) approach, we present a comparative example in Figure 4. This example qualitatively
6
7. Preprint, Under Review
Table 2: Ablation on Chain-of-Thought: The numbers of RAFT and RAFT without
CoT. Results on various datasets show that adding CoT can significantly improve the
performance of the finetuned model. With a gains of 9.66% and 14.93% in the Hotpot QA
and HuggingFace datasets respectively.
RAFT w.o CoT
RAFT
PubMed HotpotQA HuggingFace Torch Hub TensorFlow
68.30
73.30 25.62
35.28 59.07
74.00 86.56
84.95 83.21
86.86
demonstrates a scenario where the DSF model becomes confused by a question asking for
the identity of a screenwriter. Instead of providing the correct name, it mistakenly cites one
of the films written by the screenwriter. In contrast, the RAFT model accurately answers the
question. This discrepancy suggests that training a model solely with question-answer pairs
may impair its ability to derive relevant context from provided documents. The comparison
underscores the importance of incorporating both standard instructional tuning and context
comprehension into the training dataset to preserve and enhance the model’s ability to
process text effectively.
HotPot QA
Question: What screenwriter with credits for “Evolution”[0/1879]e a film starring
Nicolas Cage and Téa Leoni?
Documents: . . . David Weissman is a screenwriter and director. His film credits
include “The Family Man” (2000), “Evolution” (2001), and “When in Rome” (2010).
...
The Family Man is a 2000 American romantic comedy-drama film directed by Brett
Ratner, written by David Diamond and David Weissman, and starring Nicolas Cage
and Téa Leoni. . . .
—————————————————————————————————————
—————————–
RAFT: ##Reason: The screenwriter with credits for the film “Evolution,” starring
Nicolas Cage and Téa Leoni, is David Weissman. This information is provided in the
reference documents where it mentions David Weissman as a screenwriter with film
credits including “The Family Man” (2000), “Evolution” (2001), and “When in Rome”
(2010). Therefore, the screenwriter for “Evolution” is David Weissman. ##Answer:
David Weissman ✓
DSF: The Family Man ✗
Figure 4: Comparison of RAFT and DSF: On the HotPot QA dataset, we can see that DSF
model extracts the wrong information from the context when the question is asking who is
the screen writer and it answers a film name. RAFT manages to get the accurate results .
4.4
Should we train the LLM always with the golden context for RAG?
In our exploration of whether large language models (LLMs) should always be trained with
the golden context for Retrieval-Augmented Generation (RAG), we address a key question:
what proportion (p%) of the training data should include golden documents? Intuitively,
one might assume that for effective training in reading and extracting information from
context (e.g., RAG tasks), the golden document should always be included during training
(P = 100%). However, our findings challenge this assumption: incorporating a portion of
the training data without the golden document in the context (P = 80%) appears to enhance
the model’s performance on RAG tasks.
7
8. Preprint, Under Review
Test Domain: NQ
Test Domain: TQA
Test Domain: HoPo
0.35
0.30
0.25
0
20
40
60
80
100
P % Golden Retrieved Context at Training
0.60
0.65
0.40
Accuracy
0.45
0.60
0.55
0.50
0
20
40
60
80
100
% Golden Retrieved Context at Training
0.55
0.50
0.45
0.40
0
20
40
60
80
100
P % Golden Retrieved Context at Training
Figure 5: How many golden documents to involve? We study the hyperparameter P%
where it indicates how much portion of training data is with golden document. Results
on NQ, TQA and HotpotQA suggest that mixing some amount of data that the golden
document is not put in the context is helpful for in-domain RAG.
Figure 5 presents our investigation into the hyperparameter P%, which represents the
percentage of training instances that should include golden documents. We find that the
optimal proportion varies across datasets, with P% ranging from 40%, 60%, and 100%. This
indicates that training your LLM without the correct corresponding context at times can be
beneficial for the downstream task of answering questions related to the documents. In our
training setup, we include four distractor documents alongside the golden document, and at
test time, we maintain this format by providing the golden document with four distractors.
Our findings suggest that, for domain-specific RAG tasks, including a certain percentage of
training data without the golden documents in the context proves to be advantageous.
5
RAFT Generalizes to Top-K RAG
We now study another important problem: How does the number of distractor documents
in RAFT affect the model’s performance when augmented with top-k RAG results during
evaluation? Previous research has highlighted the vulnerability of LLMs to irrelevant text
(see studies (Shi et al., 2023a; Weston & Sukhbaatar, 2023; Liu et al., 2023)). This issue is
particularly critical for LLMs + RAG since top-k RAG is frequently employed at test time to
ensure high recall. Such a scenario necessitates the model to have the ability to discern and
disregard irrelevant content, focusing solely on pertinent information.
5.1
Making Model Robust to top-K RAG
To tackle the challenge of enhancing large language models’ (LLMs) ability to sift through
irrelevant text within the retrieval pipeline, our analysis revealed that training solely with
golden (highly relevant) documents can inadvertently diminish the model’s ability to dis-
cern and disregard irrelevant information. To address this, our algorithm, RAFT , adopts
a strategy that integrates golden documents with a mix of irrelevant ones. This method-
ology prompts us to investigate the ideal fraction of distractor (irrelevant) documents to
incorporate throughout the training process and to assess how well this training approach
adapts to different volumes of documents encountered by the Retrieval-Augmented Gen-
eration (RAG) during the test phase. Our aim is to refine the balance between relevant
and irrelevant information to strenghten the model’s efficiency in identifying and utilizing
pertinent content. Notice that Sec 4.4 looked what what P% of training data should include
distractors, while in this section, we study test-time scenarios.
Training with Distractor Documents To enhance the robustness of LLMs against irrelevant
text in retrieved documents, we adopted a finetuning approach that incorporates both
golden (highly relevant) documents and distractor (irrelevant) documents. The model was
trained with varying numbers of distractor documents, but consistently evaluated using
the top-3 documents obtained from the retriever - not to be confused with p. Our findings,
detailed in Fig. 6, reveal that finetuning with only the golden document frequently results in
inferior performance compared to configurations that include a greater number of distractor
documents. As we can see in the figure, the better performance for Natural Questions is
8
9. Preprint, Under Review
Final
0.30
0.28
0.26
0.24
0.22
Hotpot QA
Train D*
Train D* + 1D
Train D* + 2D
Train D* + 3D
0.250
Natural Questions
Train D*
Train D* + 1D
Train D* + 2D
Train D* + 3D
0.32
0.225
0.200
0.175
0.150
0.125
2
4
6
8
# Test Documents (Top-k)
2
10
4
6
8
# Test Documents (Top-k)
10
Figure 6: Test-Time Documents Varying: To analyze how robust RAFT is to varying number
of test-time documents, we study three domains – NQ, Trivia QA and HotPot QA. In NQ,
we find that training with 4 documents leads to optimal performance, and this changes to 3
and 2 for for Trivia QA and HotPot QA respectively. However, we see that training with
only golden documents leads to poor performance.
training with D ∗ + 3D and it is D ∗ + 1D documents with Hotpot QA. This insight has been
particularly beneficial for our algorithm, RAFT . In our experiments, we consistently employ
a training setup consisting of one golden document alongside four distractor documents.
Generalization to a variable number of test-time documents. We extended our research
to examine the impact of different quantities of test-time documents on the model’s per-
formance. Specifically, our experiments focused on assessing how models, trained with
varying numbers of distractor documents, respond to changes in the number of documents
presented at test time. The results, illustrated in Fig. 6, confirm that the inclusion of distrac-
tor documents during training indeed makes the model more resilient to fluctuations in the
number of documents encountered during testing. This ability to maintain consistent perfor-
mance despite variations in test-time document numbers further validates the robustness of
our approach, RAFT . This finding underscores the importance of a well-calibrated training
environment to prepare the model for a range of scenarios it may encounter in real-world.
6
Related Works
Retrieval-Augmented Language Models Retrieval-Augmented Language Models (RALMs)
enhance LLMs by integrating a retrieval module that sources relevant information from
external knowledge bases, significantly improving performance across various NLP tasks,
including language modeling (Guu et al., 2020; Borgeaud et al., 2022; Khandelwal et al.,
2019; Shi et al., 2023d; Lin et al., 2023b; Shi et al., 2023c; Asai et al., 2023; Xu et al., 2023;
Wang et al., 2023) and open-domain question answering (Izacard et al., 2023; Lewis et al.,
2020). For instance, Atlas (Izacard et al., 2023) fine-tunes T5 models with the retriever,
treating documents as latent variables, while RETRO (Borgeaud et al., 2022) modifies the
decoder-only architecture to include retrieved texts and conducts pre-training from scratch.
kNN-LM (Khandelwal et al., 2019) interpolates between the LM’s next token distribution
and distributions computed from retrieved tokens at inference. (Shi et al., 2023d; Ram
et al., 2023) assume black-box access to an LLM, combining it with either off-the-shelf or
fine-tuned retriever.
Memorization A key question around large neural language models is whether they truly
“understand” text (Feldman, 2020; Power et al., 2022) or simply rely on surface pattern
memorization (Carlini et al., 2019; Tänzer et al., 2022). (Feldman, 2020; Carlini et al., 2019;
2022) develop methodologies to quantify the extent of memorization in neural models.
(Brown et al., 2020; Power et al., 2022; Liu et al., 2022) further explored how memorization
impacts the models’ generalization capabilities. (Carlini et al., 2021; Shi et al., 2023b)
demonstrated the ability of language models to memorize and regurgitate training data,
raising significant privacy concerns (Kandpal et al., 2022; Pan et al., 2020).
Finetuning for RAG More recently, several papers have been exploring the idea of fine-
tuning a pretrained LLM to be better at RAG tasks (Lin et al., 2023a; Wang et al., 2023; Xu
9
10. Preprint, Under Review
et al., 2023; Liu et al., 2024). These works focus on constructing a combination of finetuning
dataset for RAG and train a model to perform well on these tasks. In particular, in their
settings, at test time, the domain or documents can be different than the training time;
whereas our paper studies a slightly opposite scenario where we only care about testing the
LLM on the same set of documents.
7
Conclusion
RAFT is a training strategy designed to enhance the model’s performance in answering
questions within a specific domain, in "open-book" settings. We highlight several crucial
design decisions, such as training the model alongside distractor documents, organizing the
dataset so a portion lacks golden documents in their context, and formulating answers in a
chain-of-thought manner with direct quotations from the relevant text. Our evaluations on
PubMed, HotpotQA, and Gorilla API Bench underline RAFT’s significant potential.
References
Anthropic. Prompt engineering for claude’s long context window. 2023.
Asai, A., Wu, Z., Wang, Y., Sil, A., and Hajishirzi, H. Self-rag: Learning to retrieve, generate,
and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driess-
che, G. B., Lespiau, J.-B., Damoc, B., Clark, A., et al. Improving language models by
retrieving from trillions of tokens. In International conference on machine learning, pp.
2206–2240. PMLR, 2022.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances
in neural information processing systems, 33:1877–1901, 2020.
Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., and Song, D. The secret sharer: Evaluating and
testing unintended memorization in neural networks. In 28th USENIX Security Symposium
(USENIX Security 19), pp. 267–284, 2019.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A.,
Brown, T., Song, D., Erlingsson, U., et al. Extracting training data from large language
models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying
memorization across neural language models. In The Eleventh International Conference on
Learning Representations, 2022.
Dernoncourt, F. and Lee, J. Y. Pubmed 200k rct: a dataset for sequential sentence classification
in medical abstracts. arXiv preprint arXiv:1710.06071, 2017.
Feldman, V. Does learning require memorization? a short tale about a long tail. In Proceedings
of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 954–959, 2020.
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. Retrieval augmented language model
pre-training. In International conference on machine learning, pp. 3929–3938. PMLR, 2020.
Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J.,
Joulin, A., Riedel, S., and Grave, E. Atlas: Few-shot learning with retrieval augmented
language models. Journal of Machine Learning Research, 24(251):1–43, 2023. URL http:
//jmlr.org/papers/v24/23-0037.html.
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X. Pubmedqa: A dataset for biomedical
research question answering. arXiv preprint arXiv:1909.06146, 2019.
10
11. Preprint, Under Review
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,
2017.
Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks
in language models. In International Conference on Machine Learning, pp. 10697–10707.
PMLR, 2022.
Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. General-
ization through memorization: Nearest neighbor language models. arXiv preprint
arXiv:1911.00172, 2019.
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D.,
Polosukhin, I., Devlin, J., Lee, K., et al. Natural questions: a benchmark for question
answering research. Transactions of the Association for Computational Linguistics, 7:453–466,
2019.
Lazaridou, A., Gribovskaya, E., Stokowiec, W., and Grigorev, N. Internet-augmented
language models through few-shot prompting for open-domain question answering.
arXiv preprint arXiv:2203.05115, 2022.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M.,
Yih, W.-t., Rocktäschel, T., et al. Retrieval-augmented generation for knowledge-intensive
nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
Lin, X. V., Chen, X., Chen, M., Shi, W., Lomeli, M., James, R., Rodriguez, P., Kahn, J., Szilvasy,
G., Lewis, M., et al. Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint
arXiv:2310.01352, 2023a.
Lin, X. V., Chen, X., Chen, M., Shi, W., Lomeli, M., James, R., Rodriguez, P., Kahn, J., Szilvasy,
G., Lewis, M., et al. Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint
arXiv:2310.01352, 2023b.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost
in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172,
2023.
Liu, Z., Kitouni, O., Nolte, N. S., Michaud, E., Tegmark, M., and Williams, M. Towards
understanding grokking: An effective theory of representation learning. Advances in
Neural Information Processing Systems, 35:34651–34663, 2022.
Liu, Z., Ping, W., Roy, R., Xu, P., Shoeybi, M., and Catanzaro, B. Chatqa: Building gpt-4 level
conversational qa models. arXiv preprint arXiv:2401.10225, 2024.
Pan, X., Zhang, M., Ji, S., and Yang, M. Privacy risks of general-purpose language models.
In 2020 IEEE Symposium on Security and Privacy (SP), pp. 1314–1331. IEEE, 2020.
Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Large language model connected
with massive apis. arXiv preprint arXiv:2305.15334, 2023.
Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization
beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K.,
and Shoham, Y. In-context retrieval-augmented language models. arXiv preprint
arXiv:2302.00083, 2023.
Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E. H., Schärli, N., and Zhou, D. Large
language models can be easily distracted by irrelevant context. In International Conference
on Machine Learning, pp. 31210–31227. PMLR, 2023a.
Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins, T., Chen, D., and Zettlemoyer, L.
Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789,
2023b.
11
12. Preprint, Under Review
Shi, W., Min, S., Lomeli, M., Zhou, C., Li, M., Lin, V., Smith, N. A., Zettlemoyer, L., Yih, S.,
and Lewis, M. In-context pretraining: Language modeling beyond document boundaries.
arXiv preprint arXiv:2310.10638, 2023c.
Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., and Yih, W.-t.
Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652,
2023d.
Tänzer, M., Ruder, S., and Rei, M. Memorisation versus generalisation in pre-trained lan-
guage models. In Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 7564–7578, 2022.
Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y.-H., Zhou, D., Le,
Q., et al. Freshllms: Refreshing large language models with search engine augmentation.
arXiv preprint arXiv:2310.03214, 2023.
Wang, B., Ping, W., McAfee, L., Xu, P., Li, B., Shoeybi, M., and Catanzaro, B. Instructretro:
Instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv:2310.07713,
2023.
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H.
Self-instruct: Aligning language models with self-generated instructions. arXiv preprint
arXiv:2212.10560, 2022.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al.
Chain-of-thought prompting elicits reasoning in large language models. Advances in
Neural Information Processing Systems, 35:24824–24837, 2022.
Weston, J. and Sukhbaatar, S. System 2 attention (is something you might need too). arXiv
preprint arXiv:2311.11829, 2023.
Xiong, W., Liu, J., Molybog, I., Zhang, H., Bhargava, P., Hou, R., Martin, L., Rungta, R.,
Sankararaman, K. A., Oguz, B., et al. Effective long-context scaling of foundation models.
arXiv preprint arXiv:2309.16039, 2023.
Xu, P., Ping, W., Wu, X., McAfee, L., Zhu, C., Liu, Z., Subramanian, S., Bakhturina, E.,
Shoeybi, M., and Catanzaro, B. Retrieval meets long context large language models. arXiv
preprint arXiv:2310.03025, 2023.
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D.
Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint
arXiv:1809.09600, 2018.
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al. Lima:
Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
12