Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
如果无法正常显示,请先停止浏览器的去广告插件。
1. Assisting in Writing Wikipedia-like Articles From Scratch
with Large Language Models
Yijia Shao
Yucheng Jiang Theodore A. Kanell Peter Xu
Omar Khattab Monica S. Lam
Stanford University
{shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu
lam@cs.stanford.edu
Abstract
We study how to apply large language models
to write grounded and organized long-form ar-
ticles from scratch, with comparable breadth
and depth to Wikipedia pages. This underex-
plored problem poses new challenges at the
pre-writing stage, including how to research
the topic and prepare an outline prior to writ-
ing. We propose STORM, a writing system
for the Synthesis of Topic Outlines through
Retrieval and Multi-perspective Question Ask-
ing. STORM models the pre-writing stage by
(1) discovering diverse perspectives in research-
ing the given topic, (2) simulating conversa-
tions where writers carrying different perspec-
tives pose questions to a topic expert grounded
on trusted Internet sources, (3) curating the col-
lected information to create an outline.
For evaluation, we curate FreshWiki, a dataset
of recent high-quality Wikipedia articles, and
formulate outline assessments to evaluate the
pre-writing stage. We further gather feedback
from experienced Wikipedia editors. Com-
pared to articles generated by an outline-
driven retrieval-augmented baseline, more of
STORM’s articles are deemed to be organized
(by a 25% absolute increase) and broad in cov-
erage (by 10%). The expert feedback also
helps identify new challenges for generating
grounded long articles, such as source bias
transfer and over-association of unrelated facts.
1
Introduction
Large language models (LLMs) have demonstrated
impressive writing capabilities (Yang et al., 2023;
Pavlik, 2023; Wenzlaff and Spaeth, 2022; Fitria,
2023), but it is unclear how we can use them to
write grounded, long-form articles, like full-length
Wikipedia pages. Such expository writing, which
seeks to inform the reader on a topic in an or-
ganized manner (Weaver III and Kintsch, 1991;
Balepur et al., 2023), requires thorough research
and planning in the pre-writing stage (Rohman,
Prewriting
Writing
References
Topic
2022 Winter Olympics
Opening Ceremony
Outline
Full-length
Article
Research via Question Asking
(A) Direct Prompting
Prompt: Ask 30 questions about the given topic.
LLM
1. When was the opening ceremony held?
2. Where was the opening ceremony held?
3. How many countries participated in the opening ceremony?
...
(B) Perspective-Guided Question Asking
Prompt: You are an event planner who focuses on the
preparation of the opening ceremony. …
1. Can you provide any information about the transportation
arrangements for the opening ceremony?
2. Can you provide any information about the budget for the
LLM
2022 Winter Olympics opening ceremony?
…
(C) Conversational Question Asking
LLM-
Role1
LLM-
Role2
LLM-
Role1
Can you provide me with a list of the participating countries
in the 2022 Winter Olympics opening ceremony?
The 2022 Winter Olympics featured a diverse group of
countries participating in the opening ceremony. These
included … Athletes from over 90 countries will enter the
stadium in a specific order.
How is the order of participating countries in the 2022
Winter Olympics opening ceremony determined?
Figure 1: We explore writing Wikipedia-like articles
from scratch, which demands a pre-writing stage before
producing the article. In this stage, simpler approaches
like Direct Prompting have limited planning capacity. In
contrast, STORM researches the topic via perspective-
guided question asking in simulated conversations.
1965), even before the actual writing process can
start. However, prior work on generating Wikipedia
articles (Banerjee and Mitra, 2015; Minguillón
et al., 2017; Liu et al., 2018; Fan and Gardent,
2022) has generally bypassed the pre-writing stage:
for instance, Liu et al. (2018) presume reference
documents are provided in advance, while Fan and
Gardent (2022) assume an article outline is avail-
able and focus on expanding each section. These
assumptions do not hold in general, as collecting
references and crafting outlines demand advanced
information literacy skills (Doyle, 1994) to iden-
2. tify, evaluate, and organize external sources - a task
that is challenging even for experienced writers.
Automating this process can facilitate individuals
in initiating in-depth learning about a topic and
greatly reduce the expensive expert hours neces-
sary for their expository writing.
We explore these challenges by focusing on how
to generate Wikipedia-like articles from scratch.
We decompose this problem into two tasks. The
first is to conduct research to generate an outline,
i.e., a list of multi-level sections, and collect a set of
reference documents. The second uses the outline
and the references to produce the full-length arti-
cle. Such a task decomposition mirrors the human
writing process which usually includes phases of
pre-writing, drafting, and revising (Rohman, 1965;
Munoz-Luna, 2015).
As pre-trained language models inherently pos-
sess a wealth of knowledge, a direct approach is to
rely on their parametric knowledge for generating
outlines or even entire articles (Direct Gen). How-
ever, this approach is limited by a lack of details
and hallucinations (Xu et al., 2023), particularly in
addressing long-tail topics (Kandpal et al., 2023).
This underscores the importance of leveraging ex-
ternal sources, and current strategies often involve
retrieval-augmented generation (RAG), which cir-
cles back to the problem of researching the topic in
the pre-writing stage, as much information cannot
be surfaced through simple topic searches.
Human learning theories (Tawfik et al., 2020;
Booth et al., 2003) highlight asking effective
questions in information acquisition. Although
instruction-tuned models (Ouyang et al., 2022) can
be prompted directly to generate questions, we find
that they typically produce basic “What”, “When”,
and “Where” questions (Figure 1 (A)) which often
only address surface-level facts about the topic. To
endow LLMs with the capacity to conduct better
research, we propose the STORM paradigm for
the Synthesis of Topic Outlines through Retrieval
and Multi-perspective Question Asking.
The design of STORM is based on two hypothe-
ses: (1) diverse perspectives lead to varied ques-
tions; (2) formulating in-depth questions requires
iterative research. Building upon these hypotheses,
STORM employs a novel multi-stage approach. It
first discovers diverse perspectives by retrieving
and analyzing Wikipedia articles from similar top-
ics and then personifies the LLM with specific per-
spectives for question asking (Figure 1 (B)). Next,
to elicit follow-up questions for iterative research
(Figure 1 (C)), STORM simulates multi-turn con-
versations where the answers to the generated ques-
tions are grounded on the Internet. Finally, based
on the LLM’s internal knowledge and the collected
information, STORM creates an outline that can
be expanded section by section to develop a full-
length Wikipedia-like article.
We evaluate STORM using our FreshWiki
dataset (§2.1) which curates recent, high-quality
Wikipedia articles to avoid data leakage during pre-
training. 1 To facilitate the study of the pre-writing
stage, we define metrics for evaluating the outline
quality against human-written articles.
We further invited a group of experienced
Wikipedia editors for expert evaluation. The ed-
itors found STORM outperforms an outline-driven
RAG baseline, especially regarding the breadth and
organization of the articles. They also identified
challenges for future research, including address-
ing cases where: (1) the bias on the Internet affects
the generated articles; (2) LLMs fabricate connec-
tions between unrelated facts. These challenges
present new frontiers to grounded writing systems.
Our main contributions include:
• To evaluate the capacity of LLM systems at
generating long-form grounded articles from
scratch, and the pre-writing challenge in par-
ticular, we curate the FreshWiki dataset and
establish evaluation criteria for both outline
and final article quality.
• We propose STORM, a novel system that au-
tomates the pre-writing stage. STORM re-
searches the topic and creates an outline by
using LLMs to ask incisive questions and re-
trieving trusted information from the Internet.
• Both automatic and human evaluation demon-
strate the effectiveness of our approach. Ex-
pert feedback further reveals new challenges
in generating grounded long-form articles.
2
FreshWiki
We study generating Wikipedia-like articles from
scratch, placing emphasis on the pre-writing
stage (Rohman, 1965), which involves the demand-
ing sub-tasks of gathering and curating relevant
information (“research”). This models the human
Our resources and code are released at https://github.
com/stanford-oval/storm.
1
3. Domain Scope Given
Outline? Given
Refs?
Balepur et al. (2023)
Qian et al. (2023)
Fan and Gardent (2022)
Liu et al. (2018)
Sauper and Barzilay (2009) One
All
One
All
Two One para.
One para.
Full article
One para.
Full article /
/
Yes
/
No Yes
No
No
Yes
No
Ours All Full article No No
Table 1: Comparison of different Wikipedia generation
setups in existing literature. Generating one paragraph
does not need an article outline.
writing approach which has prompted some educa-
tors to view Wikipedia article writing as an educa-
tional exercise for academic training (Tardy, 2010).
Table 1 compares our work against prior bench-
marks for Wikipedia generation. Existing work
has generally focused on evaluating the generation
of shorter snippets (e.g., one paragraph), within a
narrower scope (e.g., a specific domain or two), or
when an explicit outline or reference documents
are supplied. A notable example is WikiSum (Liu
et al., 2018), which treats generating Wikipedia ar-
ticles as a multi-document summarization problem,
with respect to the reference documents.
Our setup emphasizes the capability of long-
form grounded writing systems to research and
curate content. Specifically, given a topic t, the
task is to find a set of references R and generate
a full-length article S = s 1 s 2 ...s n , where each
sentence s i cites a list of documents in R. 2
2.1
The FreshWiki Dataset
Creating a new Wikipedia-like article demands not
only fluent writing but also good research skills. As
modern LLMs are generally trained on Wikipedia
text, we mitigate data leakage by explicitly seeking
out recent Wikipedia articles that were created (or
very heavily edited) after the training cutoff of the
LLMs we test. Our process can be repeated at
future dates when new LLMs emerge.
To apply our date criteria, we focus on the top
100 most-edited pages, based on edit counts, for
each month from February 2022 to September
2023 3 . To ensure high-quality references, we filter
these articles to keep only those having B-class
quality or above assessed by ORES 4 . We also ex-
2
In practice, S also includes organizational elements such
as section and subsection titles, which do not require citations.
3
Obtained
from
https://wikimedia.
org/api/rest_v1/metrics/edited-pages/
top-by-edits/en.wikipedia/all-editor-types/
content/{year}/{month}/all-days
4
https://www.mediawiki.org/wiki/ORES
clude list articles 5 and articles that have no sub-
sections. While high-quality Wikipedia articles
usually contain structured data (e.g., tables) and are
multi-modal, we only consider the plain text com-
ponent in constructing the dataset to simplify our
task. More details of the dataset are in Appendix A.
2.2
Outline Creation and Evaluation
A full-length article is hard to generate or evalu-
ate (Xu et al., 2023; Krishna et al., 2023). When
human educators teach students academic writing,
they sometimes supervise students at the outline
stage (Eriksson and Mäkitalo, 2015) because an
extensive outline indicates a comprehensive under-
standing of the topic and provides a solid founda-
tion for writing the full-length article (Dietz and
Foley, 2019). Inspired by this, we decompose the
generation of S into two stages. In the pre-writing
stage, we require the system to create an outline
O, which is defined as a list of multi-level section
headings 6 . In the writing stage, the system uses
the topic t, the references R, and an outline O to
produce the full-length article S.
To evaluate the outline coverage, we introduce
two metrics: heading soft recall and heading en-
tity recall. These metrics compare the multi-level
section headings of the human-written article, con-
sidered as ground truth, and those in O. Recog-
nizing that an exact match between elements in
these two sets of headings is unnecessary, we cal-
culate the heading soft recall (Fränti and Mariescu-
Istodor, 2023) using cosine similarity derived from
Sentence-BERT (Reimers and Gurevych, 2019) em-
beddings of the headings (details in Appendix C.1).
We also compute the heading entity recall which
is quantified as the percentage of named entities in
human-written article headings covered by O. We
extract entities with FLAIR named entity recogni-
tion (NER) (Akbik et al., 2019).
3
Method
We present STORM to automate the pre-writing
stage by researching a given topic via effective
question asking (§3.1, §3.2) and creating an out-
line (§3.3). The outline will be extended to a full-
length article grounded on the collected references
5
https://en.wikipedia.org/wiki/Wikipedia:
Stand-alone_lists
6
Since language models process and produce sequences,
we can linearize O by adding “#” to indicate section titles,
“##” to indicate subsection titles, etc.
4. Topic 𝒕
① Survey
Question 𝒒
② Identify
Perspectives
Wikipedia
Writer
𝒫
③
Related Articles
Expert
Read & Ask
Direct Generate
Draft Outline 𝒪 !
Split Queries
⑤ Search & Sift
⑥ Synthesize
Answer 𝒂
Gather
Add Specific Perspective
⑦
④
⑧ Refine
Add Trusted
Sources
Conversations {𝒞 " , … , 𝒞 # }
Outline 𝒪
References ℛ
Figure 2: The overview of STORM that automates the pre-writing stage. Starting with a given topic, STORM
identifies various perspectives on covering the topic by surveying related Wikipedia articles ( 1 - 2 ). It then
simulates conversations between a Wikipedia writer who asks questions guided by the given perspective and an
expert grounded on trustworthy online sources ( 3 - 6 ). The final outline is curated based on the LLM’s intrinsic
knowledge and the gathered conversations from different perspectives ( 7 - 8 ).
(§3.4). Figure 2 gives an overview of STORM and
we include the pseudo code in Appendix B.
3.1
Perspective-Guided Question Asking
Rohman (1965) defines pre-writing as the stage
of discovery in the writing process. In parallel
with stakeholder theory in business (Freeman et al.,
2010), where diverse stakeholders prioritize vary-
ing facets of a company, individuals with distinct
perspectives may concentrate on different aspects
when researching the same topic and discover mul-
tifaceted information. Further, the specific perspec-
tives can serve as prior knowledge, guiding individ-
uals to ask more in-depth questions. For example,
an event planner might ask about the “transporta-
tion arrangements” and “budget” for “the 2022
Winter Olympics opening ceremony”, whereas a
layperson might ask more general questions about
the event’s basic information (Figure 1 (A)).
Given the input topic t, STORM discovers differ-
ent perspectives by surveying existing articles from
similar topics and uses these perspectives to control
the question asking process. Specifically, STORM
prompts an LLM to generate a list of related top-
ics and subsequently extracts the tables of contents
from their corresponding Wikipedia articles, if such
articles can be obtained through Wikipedia API 7
(Figure 2 1 ). These tables of contents are con-
catenated to create a context to prompt the LLM
to identify N perspectives P = {p 1 , ..., p N } that
7
https://pypi.org/project/Wikipedia-API/
can collectively contribute to a comprehensive ar-
ticle on t (Figure 2 2 ). To ensure that the basic
information about t is also covered, we add p 0 as
“basic fact writer focusing on broadly covering the
basic facts about the topic” into P. Each perspec-
tive p ∈ P will be utilized to guide the LLM in the
process of question asking in parallel.
3.2
Simulating Conversations
The theory of questions and question asking (Ram,
1991) highlights that while answers to existing
questions contribute to a more comprehensive
understanding of a topic, they often simultane-
ously give rise to new questions. To kick off this
dynamic process, STORM simulates a conversa-
tion between a Wikipedia writer and a topic ex-
pert. In the i-th round of the conversation, the
LLM-powered Wikipedia writer generates a sin-
gle question q i based on the topic t, its assigned
perspective p ∈ P, and the conversation history
{q 1 , a 1 , ..., q i−1 , a i−1 } where a j denotes the sim-
ulated expert’s answer. The conversation history
enables the LLM to update its understanding of the
topic and ask follow-up questions. In practice, we
limit the conversation to at most M rounds.
To ensure that the conversation history provides
factual information, we use trusted sources from
the Internet to ground the answer a i to each query
q i . Since q i can be complicated, we first prompt
the LLM to break down q i into a set of search
queries (Figure 2 4 ) and the searched results will
be evaluated using a rule-based filter according to
5. the Wikipedia guideline 8 to exclude untrustworthy
sources (Figure 2 5 ). Finally, the LLM synthe-
sizes the trustworthy sources to generate the answer
a i , and these sources will also be added to R for
full article generation (§3.4).
3.3
Creating the Article Outline
After thoroughly researching the topic through
N + 1 simulated conversations, denoted as
{C 0 , C 1 , ..., C N }, STORM creates an outline before
the actual writing starts. To fully leverage the inter-
nal knowledge of LLMs, we first prompt the model
to generate a draft outline O D given only the topic
t (Figure 2 7 ). O D typically provides a general
but organized framework. Subsequently, the LLM
is prompted with the topic t, the draft outline O D ,
and the simulated conversations {C 0 , C 1 , ..., C N }
to refine the outline (Figure 2 8 ). This results
in an improved outline O which will be used for
producing the full-length article.
3.4
Writing the Full-Length Article
Building upon the references R collected and the
outline O developed during the pre-writing stage,
the full-length article can be composed section by
section. Since it is usually impossible to fit the
entire R within the context window of the LLM,
we use the section title and headings of its all-level
subsections to retrieve relevant documents from
R based on semantic similarity calculated from
Sentence-BERT embeddings. With the relevant in-
formation at hand, the LLM is then prompted to
generate the section with citations. Once all sec-
tions are generated, they are concatenated to form
the full-length article. Since the sections are gen-
erated in parallel, we prompt the LLM with the
concatenated article to delete repeated information
to improve coherence. Furthermore, in alignment
with Wikipedia’s stylistic norms, the LLM is also
utilized to synthesize a summary of the entire arti-
cle, forming the lead section at the beginning.
4
Experiments
4.1
Article Selection
randomly select 100 samples from the FreshWiki
dataset (see §2.1) that have human-written articles
not exceeding 3000 words.
4.2
Automatic Metrics
As discussed in §2.2, we evaluate the outline qual-
ity to assess the pre-writing stage by calculating
the heading soft recall and heading entity recall. A
higher recall score signifies a more comprehensive
outline relative to the human-written article.
To assess the full-length article quality, we adopt
ROUGE scores (Lin, 2004) and compute the entity
recall in the article level based on FLAIR NER
results. Moreover, based on Wikipedia criteria 9 ,
we evaluate the article from the aspects of (1) In-
terest Level, (2) Coherence and Organization, (3)
Relevance and Focus, (4) Coverage, and (5) Verifia-
bility. For aspects (1)-(4), we use Prometheus (Kim
et al., 2023), a 13B evaluator LLM to score the arti-
cle based on a 5-point rubric collaboratively devel-
oped with two experienced Wikipedia editors (see
Appendix C.2). For verifiability, we calculate the
citation recall and citation precision based on the
definition in Gao et al. (2023). We use Mistral 7B-
Instruct (Jiang et al., 2023a) to examine whether
the cited passages entail the generated sentence.
4.3
Baselines
As prior works use different setups and do not use
LLMs, they are hard to compare directly. Instead,
we use the following three LLM-based baselines.
1. Direct Gen, a baseline that directly prompts
the LLM to generate an outline, which is then
used to generate the full-length article.
2. RAG, a retrieval-augmented generation base-
line that searches with the topic and uses the
searched results together with the topic t to
generate an outline or the entire article.
3. Outline-driven RAG (oRAG), which is iden-
tical to RAG in outline creation, but further
searches additional information with section
titles to generate the article section by section.
4.4
STORM Implementation
STORM is capable of researching complicated top-
ics and writing long articles from detailed outlines.
However, in this controlled experiment, we limit
the final output to at most 4000 tokens (roughly
3000 words). For a meaningful comparison, we We build STORM with zero-shot prompting us-
ing the DSPy framework (Khattab et al., 2023).
Appendix B includes the pseudo code and corre-
sponding prompts. The hyperparameters N and M
https://en.wikipedia.org/wiki/Wikipedia:
Reliable_sources https://en.wikipedia.org/wiki/Wikipedia:
Good_article_criteria
8
9
6. Comparsion with Human-written Articles
ROUGE-1 ROUGE-L
Entity Recall
Interest Level
Rubric Grading
Organization Relevance
Coverage
Direct Gen
RAG
oRAG 25.62
28.52
44.26 12.63
13.18
16.51 5.08
7.57
12.57 2.87
3.14
3.90 4.60
4.22
4.79 3.10
3.05
4.09 4.16
4.08
4.70
STORM
w/o Outline Stage 45.82
26.77 16.70
12.77 14.10†
7.39 3.99†
3.33 4.82
4.87 4.45†
3.35 4.88†
4.37
Table 2: Results of automatic article quality evaluation. † denotes significant differences (p < 0.05) from a paired
t-test between STORM and the best baseline, i.e., oRAG. The rubric grading uses a 1-5 scale.
GPT-3.5
GPT-4
Heading
Soft Recall Heading
Entity Recall
Direct Gen
RAG/oRAG
RAG-expand 80.23
73.59
74.40 32.39
33.85
33.85
STORM
w/o Perspective
w/o Conversation 86.26†
84.49
77.97 40.52†
40.12
31.98
Direct Gen
RAG/oRAG
RAG-expand 87.66
89.55
91.36 34.78
42.38
43.53
STORM
w/o Perspective
w/o Conversation 92.73†
92.39
88.75 45.91
42.70
39.30
Table 3: Results of outline quality evaluation (%). † de-
notes significant differences (p < 0.05) from a paired
t-test between STORM and baselines.
in STORM are both set as 5. We use the chat
model gpt-3.5-turbo for question asking and
use gpt-3.5-turbo-instruct for other parts of
STORM. We also experiment with using gpt-4 for
drafting and refining the outline (Figure 2 7 - 8 ).
For reported results, the simulated topic expert in
STORM is grounded on the You.com search API 10 ,
although the proposed pipeline is compatible with
other search engines. The ground truth Wikipedia
article is excluded from the search results.
For final article generation, we only report the
results using gpt-4 as gpt-3.5 is not faithful to
sources when generating text with citations (Gao
et al., 2023). We set temperature as 1.0 and top_p
as 0.9 for all experiments.
5
Results and Analysis
5.1
Main Results
We use outline coverage as a proxy to assess the pre-
writing stage (see §2.2). Table 3 shows the heading
soft recall and entity recall. Outlines directly gen-
erated by LLMs (Direct Gen) already demonstrate
https://documentation.you.com/api-reference/
search
10
high heading soft recall, indicating LLMs’ ability
to grasp high-level aspects of a topic through their
rich parametric knowledge. However, STORM, by
asking effective questions to research the topic, can
create higher recall outlines that cover more topic-
specific aspects. Notably, although RAG leverages
additional information, presenting unorganized in-
formation in the context window makes outline
generation more challenging for the weaker model,
i.e., GPT-3.5, leading to worse performance. To test
the limit of the RAG baseline, we further expand
the retrieved sources by starting with the outline
produced by RAG, using its section titles as search
queries to collect more sources, and inputting the
newly collected sources together with the initial
outline to LLM to generate a polished outline. This
modified approach is referred to as “RAG-expand”
in Table 3. The experiment results indicate that
even though having an additional round of search
and refinement can improve the outline produced
by RAG, our proposed STORM still surpasses its
performance.
We further evaluate the full-length article quality.
As shown in Table 2, oRAG significantly outper-
forms RAG, highlighting the effectiveness of using
outlines for structuring full-length article genera-
tion. Despite this method’s advantages in leverag-
ing retrieval and outlining, our approach still out-
performs it. The effective question asking mecha-
nism enhances the articles with greater entity recall.
The evaluator LLM also rates these articles with sig-
nificantly higher scores in the aspects of “Interest
Level”, “Relevance and Focus”, and “Coverage”.
Nonetheless, we acknowledge the possibility of
the evaluator LLM overrating machine-generated
text. Our careful human evaluation (§6) reveals
that STORM still has much room for improvement.
Although this work primarily focuses on the pre-
writing stage and does not optimize generating text
with citations, we still examine the citation quality
of articles produced by our approach. As reported
7. STORM
Citation Recall Citation Precision
84.83 85.18
Table 4: Citation quality judged by Mistral 7B-Instruct.
|R|
STORM w/o Perspective w/o Conversation
99.83 54.36 39.56
Table 5: Average number of unique references (|R|)
collected using different methods.
in Table 4, Mistral 7B-Instruct judges 84.83% of
the sentences are supported by their citations. Ap-
pendix C.3 investigates the unsupported sentences
and reveals that the primary issues stem from draw-
ing improper inferences and inaccurate paraphras-
ing, rather than hallucinating non-existent contents.
5.2
Ablation Studies
As introduced in §3, STORM prompts LLMs to
ask effective questions by discovering specific
perspectives and simulating multi-turn conversa-
tions. We conduct the ablation study on outline
creation by comparing STORM with two variants:
(1) “STORM w/o Perspective”, which omits per-
spective in the question generation prompt; (2)
“STORM w/o Conversation”, which prompts LLMs
to generate a set number of questions altogether. To
ensure a fair comparison, we control an equal total
number of generated questions across all variants.
Table 3 shows the ablation results and full STORM
pipeline produces outlines with the highest recall.
Also, “STORM w/o Conversation” gives much
worse results, indicating reading relevant informa-
tion is crucial to generating effective questions. We
further examine how many unique sources are col-
lected in R via different variants. As shown in Ta-
ble 5, the full pipeline discovers more different
sources and the trend is in accord with the auto-
matic metrics for outline quality.
We also verify whether having an outline stage
is necessary with STORM. In Table 2, “STORM
w/o Outline Stage” denotes the results of generat-
ing the entire article given the topic and the sim-
ulated conversations. Removing the outline stage
significantly deteriorates the performance across
all metrics.
6
Human Evaluation
To better understand the strengths and weaknesses
of STORM, we conduct human evaluation by col-
laborating with 10 experienced Wikipedia editors
Avg.
Interest Level
Organization
Relevance
Coverage
Verifiability
#Preferred
3.63
3.25
3.93
3.58
3.85
oRAG
≥ 4 Rates
57.5%
45.0%
62.5%
57.5%
67.5%
14
STORM
Av.g. ≥ 4 Rates
4.03
4.00
4.15
4.00
3.80
70.0%
70.0%
65.0%
67.5%
67.5%
p-value
0.077
0.005
0.347
0.084
0.843
26
Table 6: Human evaluation results on 20 pairs of articles
generated by STORM and oRAG. Each pair of articles
is evaluated by two Wikipedia editors. The ratings are
given on a scale between 1 and 7, with values ≥ 4
indicating good quality (see Table 10). We conduct
paired t-test and report the p-value.
who have made at least 500 edits on Wikipedia and
have more than 1 year of experience. We randomly
sample 20 topics from our dataset and evaluate the
articles generated by our method and oRAG, the
best baseline according to the automatic evaluation.
Each pair of articles is assigned to 2 editors.
We request editors to judge each article from the
same five aspects defined in §4.2, but using a 1 to
7 scale for more fine-grained evaluation. While
our automatic evaluation uses citation quality as
a proxy to evaluate Verifiability, we stick to the
Wikipedia standard of “verifiable with no original
research” in human evaluation. Besides rating the
articles, editors are asked to provide open-ended
feedback and pairwise preference. After the evalua-
tion finishes, they are further requested to compare
an article produced by our method, which they have
just reviewed, with its human-written counterpart,
and report their perceived usefulness of STORM
using a 1-5 Likert scale. More human evaluation de-
tails are included in Appendix D. Table 6 presents
the rating and pairwise comparison results. 11
Articles produced by STORM exhibit greater
breadth and depth than oRAG outputs. In ac-
cord with the finding in §5.1, editors judge articles
produced by STORM as more interesting, orga-
nized, and having broader coverage compared to
oRAG outputs. Specifically, 25% more articles pro-
duced by STORM are considered organized (Orga-
nization rating ≥ 4), and 10% more are deemed to
have good coverage (Coverage rating ≥ 4). Even
in comparison with human-written articles, one
editor praises our result as providing “a bit more
11
For the 1-7 scale rating results on each criterion, we cal-
culate the Krippendorff’s Alpha to measure the inter annotator
agreement (IAA), and the results are as follows: Interest Level
(0.349), Organization (0.221), Relevance (0.256), Coverage
(0.346), Verifiability (0.388).
8. Strongly Disagree
Somewhat Disagree
Neutral
Somewhat Agree
I think it can be specifically helpful
for my pre-writing stage.
Strongly Agree
70%
I think it will help me edit a Wikipedia
article for a new topic. 20%
I think it can be a potentially useful
10%
tool for the Wikipedia community. 20%
50%
60%
30%
30%
10%
Figure 3: Survey results of the perceived usefulness of
STORM (n = 10).
background information” and another notes that “I
found that the AI articles had more depth compared
to the Wikipedia articles”. STORM also outper-
forms the best baseline in pairwise comparison.
More information in |R| poses challenges be-
yond factual hallucination. We examine 14 pair-
wise comparison responses where editors prefer
oRAG outputs over STORM. Excluding 3 cases
where pairwise preferences do not align with their
ratings, editors assign lower Verifiability scores to
articles from our approach in over 50% of the cases.
Through analyzing the articles and editors’ free-
form feedback, we discover that low Verifiability
scores stem from red herring fallacy or overspec-
ulation issues. These arise when the generated
articles introduce unverifiable connections between
different pieces of information in |R| or between
the information and the topic (examples included
in Table 11). Compared to the widely discussed
factual hallucination (Shuster et al., 2021; Huang
et al., 2023), addressing such verifiability issues is
more nuanced, surpassing basic fact-checking (Min
et al., 2023).
Generated articles trail behind well-revised hu-
man works. While STORM outperforms the
oRAG baseline, editors comment that the generated
articles are less informative than actual Wikipedia
pages. Another major issue identified is the trans-
fer of bias and tone from Internet sources to the
generated article, with 7 out of 10 editors men-
tioning that the STORM-generated articles sound
“emotional” or “unneutral”. More analysis is dis-
cussed in Appendix E. This feedback suggests that
reducing the retrieval bias in the pre-writing stage
is a worthwhile direction for future work.
Generated articles are a good starting point. As
shown in Figure 3, editors are unanimous in agree-
ing that STORM can aid them in their pre-writing
stage. It is gratifying to know that the tool is help-
ful to experienced editors. 80% of the editors think
that STORM can help them edit a Wikipedia article
for a new topic. More reservation is expressed to
the usefulness of STORM for the Wikipedia com-
munity at large; nonetheless, 70% of the editors
think it is useful, with only 10% disagreeing.
7
Related Works
Retrieval-Augmented Generation (RAG) Aug-
menting language models (LMs) with retrieval at
inference time is a typical way to leverage exter-
nal knowledge stores (Ram et al., 2023; Izacard
et al., 2023). While some works use retrieval
to construct demonstrations for in-context learn-
ing (Li et al., 2023; Liu et al., 2022; Agrawal et al.,
2023; Poesia et al., 2022; Shi et al., 2022; Khattab
et al., 2022), another line of works uses retrieval to
provide additional information for LMs to ground
on. Lewis et al. (2020) study RAG on knowledge-
intensive NLP tasks and find it improves diver-
sity and factuality. Semnani et al. (2023) de-
signs a RAG-based chatbot grounded on English
Wikipedia to stop LLM-based chatbots from hal-
lucination. Besides, RAG can be used to generate
text with citations (Menick et al., 2022; Gao et al.,
2023) and build attributed question answering sys-
tems (Bohnet et al., 2023). While RAG is widely
studied in question answering, how to use it for
long-form article generation is less investigated.
As a general framework, RAG is flexible in both
the retrieval source and time. The retrieval sources
can vary from domain databases (Zakka et al.,
2023), code documentation (Zhou et al., 2023),
to the whole Internet (Nakano et al., 2022; Komeili
et al., 2022). Regarding the time, besides a one-
time retrieval before generation, the system can be
designed to self-decide when to retrieve across the
course of the generation (Jiang et al., 2023b; Parisi
et al., 2022; Shuster et al., 2022; Yao et al., 2023).
Automatic Expository Writing Different from
other types of long-form generation (Yang et al.,
2022; Feng et al., 2018), automatic expository writ-
ing requires grounding on external documents and
leveraging the interplay between reading and writ-
ing. Balepur et al. (2023) propose the Imitate-
Retrieve-Paraphrase framework for expository writ-
ing at the paragraph level to address the challenges
in synthesizing information from multiple sources.
Beyond summarizing sources, Shen et al. (2023)
highlight that expository writing requires the au-
thor’s sensemaking process over source documents
and good outline planning. We tackle these chal-
lenges by focusing on the pre-writing stage.
9. Question Asking in NLP Question asking capa-
bilities in NLP systems have expanded across sev-
eral fronts, including generating clarification ques-
tions to understand user intents (Aliannejadi et al.,
2019; Rahmani et al., 2023), and breaking large
questions into smaller ones to improve composi-
tional reasoning (Press et al., 2023). While humans
usually ask questions to learn new knowledge (Taw-
fik et al., 2020; Booth et al., 2003), how to opti-
mize question informativeness and specificity in
information-seeking conversations remains less ex-
plored. The closest work is Qi et al. (2020) which
defines the question informativeness using the un-
igram precision function and uses reinforcement
learning to increase the question informativeness.
8
Conclusion
We propose STORM, an LLM-based writing sys-
tem that automates the pre-writing stage for creat-
ing Wikipedia-like articles from scratch. We cu-
rate the FreshWiki dataset and establish evaluation
criteria to study the generation of grounded long-
form articles. Experimental results demonstrate
that the question asking mechanism in STORM
improves both the outline and article quality. With
the improved breadth and depth, STORM helps
surface new challenges for grounded writing sys-
tems through expert evaluation. The experienced
Wikipedia editors in our study unanimously agree
that STORM is helpful for their pre-writing stage.
Limitations
In this work, we explore generating Wikipedia-
like articles from scratch as a way to push the
frontier of automatic expository writing and long-
form article generation. While our approach sig-
nificantly outperforms baseline methods in both
automatic and human evaluations, the quality of
machine-written articles still lags behind well-
revised human-authored articles, specifically in
aspects of neutrality and verifiability. Although
STORM discovers different perspectives in re-
searching the given topic, the collected information
may still be biased towards dominant sources on
the Internet and may contain promotional content.
Moreover, the verifiability issues identified in this
work go beyond factual hallucination, which high-
lights new challenges to grounded writing systems.
Another limitation of this work is that although
we focus on the task of generating Wikipedia-like
articles from scratch, our task setup is still simpli-
fied to only consider the generation of free-form
text. Human-authored high-quality Wikipedia ar-
ticles usually contain structured data and multi-
modal information. We leave the exploration of
generating multi-modal grounded articles for fu-
ture work.
Acknowledgements
We thank You.com for generously providing the
search API that supported our experiments. We
also thank Sina J. Semnani, Shicheng Liu, Eric Ze-
likman for providing helpful feedback and the ACL
ARR reviewers for their valuable comments. This
work is supported in part by the Verdant Founda-
tion and Microsoft Azure AI credits. Yijia Shao
is supported by a Stanford School of Engineering
Fellowship.
Ethics Statement
Different from the creative generation, grounded ar-
ticle generation may impact how people learn about
topics or consume source information. All the stud-
ies and the evaluation in this work are designed
to prevent the dissemination of misinformation by
not publishing generated content online and im-
plementing strict accuracy checks. We avoid any
disruption to Wikipedia or related communities, as
our system does not interact with live pages. Also,
although we try to generate grounded articles, we
believe there is no privacy issue related to this work
as we only use information publicly available on
the Internet.
The primary risk of our work is that the
Wikipedia articles written by our system are
grounded on information on the Internet which
contains some biased or discriminative content on
its own. Currently, our system relies on the search
engine to retrieve information but does not include
any post-processing module. We believe improv-
ing the retrieval module to have good coverage of
different viewpoints and adding a content sifting
module to the current system will be a critical next
step to achieve better neutrality and balance in the
generated articles.
Another limitation we see from an ethical point
of view is that we only consider writing English
Wikipedia articles in this work. Extending the cur-
rent system to a multilingual setup is a meaningful
direction for future work as more topics do not have
Wikipedia pages in non-English languages.
10. References
Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke
Zettlemoyer, and Marjan Ghazvininejad. 2023. In-
context examples selection for machine translation.
In Findings of the Association for Computational
Linguistics: ACL 2023, pages 8857–8873, Toronto,
Canada. Association for Computational Linguistics.
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif
Rasul, Stefan Schweter, and Roland Vollgraf. 2019.
FLAIR: An easy-to-use framework for state-of-the-
art NLP. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics (Demonstrations), pages
54–59, Minneapolis, Minnesota. Association for
Computational Linguistics.
Mohammad Aliannejadi, Hamed Zamani, Fabio
Crestani, and W Bruce Croft. 2019. Asking clari-
fying questions in open-domain information-seeking
conversations. In Proceedings of the 42nd interna-
tional acm sigir conference on research and develop-
ment in information retrieval, pages 475–484.
Nishant Balepur, Jie Huang, and Kevin Chang. 2023.
Expository text generation: Imitate, retrieve, para-
phrase. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Process-
ing, pages 11896–11919, Singapore. Association for
Computational Linguistics.
Siddhartha Banerjee and Prasenjit Mitra. 2015.
WikiKreator: Improving Wikipedia stubs automat-
ically. In Proceedings of the 53rd Annual Meet-
ing of the Association for Computational Linguis-
tics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Pa-
pers), pages 867–877, Beijing, China. Association
for Computational Linguistics.
Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aha-
roni, Daniel Andor, Livio Baldini Soares, Massimil-
iano Ciaramita, Jacob Eisenstein, Kuzman Ganchev,
Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma,
Jianmo Ni, Lierni Sestorain Saralegui, Tal Schus-
ter, William W. Cohen, Michael Collins, Dipanjan
Das, Donald Metzler, Slav Petrov, and Kellie Webster.
2023. Attributed question answering: Evaluation and
modeling for attributed large language models.
Wayne C Booth, Gregory G Colomb, and Joseph M
Williams. 2003. The craft of research. University of
Chicago press.
Laura Dietz and John Foley. 2019. Trec car y3: Com-
plex answer retrieval overview. In Proceedings of
Text REtrieval Conference (TREC).
Christina S Doyle. 1994. Information literacy in an
information society: A concept for the information
age. Diane Publishing.
Ann-Marie Eriksson and Åsa Mäkitalo. 2015. Supervi-
sion at the outline stage: Introducing and encounter-
ing issues of sustainable development through aca-
demic writing assignments. Text & Talk, 35(2):123–
153.
Angela Fan and Claire Gardent. 2022. Generating bi-
ographies on Wikipedia: The impact of gender bias
on the retrieval-based generation of women biogra-
phies. In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 8561–8576, Dublin,
Ireland. Association for Computational Linguistics.
Xiaocheng Feng, Ming Liu, Jiahao Liu, Bing Qin, Yibo
Sun, and Ting Liu. 2018. Topic-to-essay generation
with neural networks. In IJCAI, pages 4078–4084.
Tira Nur Fitria. 2023. Artificial intelligence (ai) tech-
nology in openai chatgpt application: A review of
chatgpt in writing english essay. In ELT Forum: Jour-
nal of English Language Teaching, volume 12, pages
44–58.
Pasi Fränti and Radu Mariescu-Istodor. 2023. Soft preci-
sion and recall. Pattern Recognition Letters, 167:115–
121.
R Edward Freeman, Jeffrey S Harrison, Andrew C
Wicks, Bidhan L Parmar, and Simone De Colle. 2010.
Stakeholder theory: The state of the art.
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen.
2023. Enabling large language models to generate
text with citations. In Proceedings of the 2023 Con-
ference on Empirical Methods in Natural Language
Processing, pages 6465–6488, Singapore. Associa-
tion for Computational Linguistics.
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong,
Zhangyin Feng, Haotian Wang, Qianglong Chen,
Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting
Liu. 2023. A survey on hallucination in large lan-
guage models: Principles, taxonomy, challenges, and
open questions.
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas
Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-
Yu, Armand Joulin, Sebastian Riedel, and Edouard
Grave. 2023. Atlas: Few-shot learning with retrieval
augmented language models. Journal of Machine
Learning Research, 24(251):1–43.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, et al. 2023a. Mistral
7b. arXiv preprint arXiv:2310.06825.
Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun,
Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie
Callan, and Graham Neubig. 2023b. Active retrieval
augmented generation. In Proceedings of the 2023
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 7969–7992, Singapore. As-
sociation for Computational Linguistics.
11. Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric
Wallace, and Colin Raffel. 2023. Large language
models struggle to learn long-tail knowledge. In In-
ternational Conference on Machine Learning, pages
15696–15707. PMLR.
Omar Khattab, Keshav Santhanam, Xiang Lisa
Li, David Hall, Percy Liang, Christopher Potts,
and Matei Zaharia. 2022. Demonstrate-search-
predict: Composing retrieval and language mod-
els for knowledge-intensive NLP. arXiv preprint
arXiv:2212.14024.
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari,
Zhiyuan Zhang, Keshav Santhanam, Sri Vard-
hamanan, Saiful Haq, Ashutosh Sharma, Thomas T.
Joshi, Hanna Moazam, Heather Miller, Matei Za-
haria, and Christopher Potts. 2023. Dspy: Compiling
declarative language model calls into self-improving
pipelines. arXiv preprint arXiv:2310.03714.
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang,
Shayne Longpre, Hwaran Lee, Sangdoo Yun,
Seongjin Shin, Sungdong Kim, James Thorne, et al.
2023. Prometheus: Inducing fine-grained evalua-
tion capability in language models. arXiv preprint
arXiv:2310.08491.
Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022.
Internet-augmented dialogue generation. In Proceed-
ings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
pers), pages 8460–8478, Dublin, Ireland. Association
for Computational Linguistics.
Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit
Iyyer, Pradeep Dasigi, Arman Cohan, and Kyle Lo.
2023. LongEval: Guidelines for human evaluation of
faithfulness in long-form summarization. In Proceed-
ings of the 17th Conference of the European Chap-
ter of the Association for Computational Linguistics,
pages 1650–1669, Dubrovnik, Croatia. Association
for Computational Linguistics.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, et al. 2020. Retrieval-augmented generation
for knowledge-intensive nlp tasks. Advances in Neu-
ral Information Processing Systems, 33:9459–9474.
Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu,
Yuan Ni, Guotong Xie, Xiaoling Wang, and Xipeng
Qiu. 2023. Unified demonstration retriever for in-
context learning. In Proceedings of the 61st Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 4644–4668,
Toronto, Canada. Association for Computational Lin-
guistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics.
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan,
Lawrence Carin, and Weizhu Chen. 2022. What
makes good in-context examples for GPT-3? In
Proceedings of Deep Learning Inside Out (DeeLIO
2022): The 3rd Workshop on Knowledge Extrac-
tion and Integration for Deep Learning Architectures,
pages 100–114, Dublin, Ireland and Online. Associa-
tion for Computational Linguistics.
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben
Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Shazeer. 2018. Generating wikipedia by summariz-
ing long sequences. In International Conference on
Learning Representations.
Jacob Menick, Maja Trebacz, Vladimir Mikulik,
John Aslanides, Francis Song, Martin Chadwick,
Mia Glaese, Susannah Young, Lucy Campbell-
Gillingham, Geoffrey Irving, and Nat McAleese.
2022. Teaching language models to support answers
with verified quotes.
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis,
Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettle-
moyer, and Hannaneh Hajishirzi. 2023. FActScore:
Fine-grained atomic evaluation of factual precision
in long form text generation. In Proceedings of the
2023 Conference on Empirical Methods in Natural
Language Processing, pages 12076–12100, Singa-
pore. Association for Computational Linguistics.
Julià Minguillón, Maura Lerga, Eduard Aibar, Josep
Lladós-Masllorens, and Antoni Meseguer-Artola.
2017. Semi-automatic generation of a corpus of
wikipedia articles on science and technology. Profe-
sional de la Información, 26(5):995–1005.
Rosa Munoz-Luna. 2015. Main ingredients for suc-
cess in l2 academic writing: Outlining, drafting and
proofreading. PloS one, 10(6):e0128309.
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,
Long Ouyang, Christina Kim, Christopher Hesse,
Shantanu Jain, Vineet Kosaraju, William Saunders,
Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen
Krueger, Kevin Button, Matthew Knight, Benjamin
Chess, and John Schulman. 2022. Webgpt: Browser-
assisted question-answering with human feedback.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
2022. Training language models to follow instruc-
tions with human feedback. Advances in Neural
Information Processing Systems, 35:27730–27744.
Aaron Parisi, Yao Zhao, and Noah Fiedel. 2022. Talm:
Tool augmented language models.
John V Pavlik. 2023. Collaborating with chatgpt: Con-
sidering the implications of generative artificial intel-
ligence for journalism and media education. Journal-
ism & Mass Communication Educator, 78(1):84–93.
12. Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari,
Gustavo Soares, Christopher Meek, and Sumit Gul-
wani. 2022. Synchromesh: Reliable code generation
from pre-trained language models. In International
Conference on Learning Representations.
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt,
Noah Smith, and Mike Lewis. 2023. Measuring and
narrowing the compositionality gap in language mod-
els. In Findings of the Association for Computational
Linguistics: EMNLP 2023, pages 5687–5711, Singa-
pore. Association for Computational Linguistics.
Peng Qi, Yuhao Zhang, and Christopher D. Manning.
2020. Stay hungry, stay focused: Generating infor-
mative and specific questions in information-seeking
conversations. In Findings of the Association for
Computational Linguistics: EMNLP 2020, pages 25–
40, Online. Association for Computational Linguis-
tics.
Hongjing Qian, Yutao Zhu, Zhicheng Dou, Haoqi Gu,
Xinyu Zhang, Zheng Liu, Ruofei Lai, Zhao Cao,
Jian-Yun Nie, and Ji-Rong Wen. 2023. Webbrain:
Learning to generate factually correct articles for
queries by grounding on large web corpus.
Hossein A. Rahmani, Xi Wang, Yue Feng, Qiang Zhang,
Emine Yilmaz, and Aldo Lipani. 2023. A survey on
asking clarification questions datasets in conversa-
tional systems. In Proceedings of the 61st Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 2698–2716,
Toronto, Canada. Association for Computational Lin-
guistics.
Ashwin Ram. 1991. A theory of questions and question
asking. Journal of the Learning Sciences, 1(3-4):273–
318.
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay,
Amnon Shashua, Kevin Leyton-Brown, and Yoav
Shoham. 2023. In-context retrieval-augmented lan-
guage models. Transactions of the Association for
Computational Linguistics.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
BERT: Sentence embeddings using Siamese BERT-
networks. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
3982–3992, Hong Kong, China. Association for Com-
putational Linguistics.
D Gordon Rohman. 1965. Pre-writing the stage of dis-
covery in the writing process. College composition
and communication, 16(2):106–112.
Christina Sauper and Regina Barzilay. 2009. Auto-
matically generating Wikipedia articles: A structure-
aware approach. In Proceedings of the Joint Con-
ference of the 47th Annual Meeting of the ACL and
the 4th International Joint Conference on Natural
Language Processing of the AFNLP, pages 208–216,
Suntec, Singapore. Association for Computational
Linguistics.
Sina Semnani, Violet Yao, Heidi Zhang, and Monica
Lam. 2023. WikiChat: Stopping the hallucination of
large language model chatbots by few-shot ground-
ing on Wikipedia. In Findings of the Association
for Computational Linguistics: EMNLP 2023, pages
2387–2413, Singapore. Association for Computa-
tional Linguistics.
Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo,
Jonathan Bragg, Jeff Hammerbacher, Doug Downey,
Joseph Chee Chang, and David Sontag. 2023. Be-
yond summarization: Designing ai support for real-
world expository writing tasks.
Weijia Shi, Julian Michael, Suchin Gururangan, and
Luke Zettlemoyer. 2022. Nearest neighbor zero-shot
inference. In Proceedings of the 2022 Conference on
Empirical Methods in Natural Language Processing,
pages 3254–3265, Abu Dhabi, United Arab Emirates.
Association for Computational Linguistics.
Kurt Shuster, Mojtaba Komeili, Leonard Adolphs,
Stephen Roller, Arthur Szlam, and Jason Weston.
2022. Language models that seek for knowledge:
Modular search & generation for dialogue and
prompt completion. In Findings of the Association
for Computational Linguistics: EMNLP 2022, pages
373–393, Abu Dhabi, United Arab Emirates. Associ-
ation for Computational Linguistics.
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
and Jason Weston. 2021. Retrieval augmentation
reduces hallucination in conversation. In Findings
of the Association for Computational Linguistics:
EMNLP 2021, pages 3784–3803, Punta Cana, Do-
minican Republic. Association for Computational
Linguistics.
Christine M Tardy. 2010. Writing for the world:
Wikipedia as an introduction to academic writing. In
English teaching forum, volume 48, page 12. ERIC.
Andrew A Tawfik, Arthur Graesser, Jessica Gatewood,
and Jaclyn Gishbaugher. 2020. Role of questions in
inquiry-based instruction: towards a design taxon-
omy for question-asking and implications for design.
Educational Technology Research and Development,
68:653–678.
Charles A Weaver III and Walter Kintsch. 1991. Expos-
itory text.
Karsten Wenzlaff and Sebastian Spaeth. 2022. Smarter
than humans? validating how openai’s chatgpt model
explains crowdfunding, alternative finance and com-
munity finance. Validating how OpenAI’s ChatGPT
model explains Crowdfunding, Alternative Finance
and Community Finance.(December 22, 2022).
Fangyuan Xu, Yixiao Song, Mohit Iyyer, and Eunsol
Choi. 2023. A critical evaluation of evaluations for
long-form question answering. In Proceedings of the
61st Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
3225–3245, Toronto, Canada. Association for Com-
putational Linguistics.
13. Kevin Yang, Dan Klein, Nanyun Peng, and Yuandong
Tian. 2023. DOC: Improving long story coherence
with detailed outline control. In Proceedings of the
61st Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
3378–3465, Toronto, Canada. Association for Com-
putational Linguistics.
Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan
Klein. 2022. Re3: Generating longer stories with
recursive reprompting and revision. In Proceedings
of the 2022 Conference on Empirical Methods in Nat-
ural Language Processing, pages 4393–4479, Abu
Dhabi, United Arab Emirates. Association for Com-
putational Linguistics.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik R Narasimhan, and Yuan Cao. 2023.
React: Synergizing reasoning and acting in language
models. In The Eleventh International Conference
on Learning Representations.
Cyril Zakka, Akash Chaurasia, Rohan Shad, Alex R
Dalal, Jennifer L Kim, Michael Moor, Kevin Alexan-
der, Euan Ashley, Jack Boyd, Kathleen Boyd, et al.
2023. Almanac: Retrieval-augmented language mod-
els for clinical medicine. Research Square.
Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao Jiang,
and Graham Neubig. 2023. Docprompting: Gener-
ating code by retrieving the docs. In The Eleventh
International Conference on Learning Representa-
tions.
14. Average Numer of Sections
Average Number of All-level Headings
8.4
15.8
Average Length of a Section
Average Length of Total Article 327.8
2159.1
Average Number of References 90.1
Table 7: Statistics of the dataset used in our experiments.
Figure 5: Distribution of edit counts for Wikipedia arti-
cles in our experiments (n = 100).
B
Figure 4: Evolution of reference count in the Wikipedia
article editing process.
A
Dataset Details
Pseudo Code of STORM
In §3, we introduce STORM, a framework that au-
tomates the pre-writing stage by discovering differ-
ent perspectives, simulating information-seeking
conversations, and creating a comprehensive out-
line. Algorithm 1 displays the skeleton of STORM.
We implement STORM with zero-shot prompt-
ing using the DSPy framework (Khattab et al.,
2023). Listing 1 and 2 show the prompts used
in our implementation. We highlight that STORM
offers a general framework designed to assist the
creation of grounded, long-form articles, without
depending extensively on prompt engineering for a
single domain.
As discussed in §2.1, we curate the FreshWiki
dataset by collecting recent and high-quality En-
glish Wikipedia articles. We select the most-edited
pages over a specific period rather than using cre-
ation dates as a cutoff because most of Wikipedia
articles are “stubs” or are of low quality when they
were created. For quality, we consider articles pre-
dicted to be of B-class quality or above. According
to Wikipedia statistics 12 , only around 3% of ex-
isting Wikipedia pages meet this quality standard.
As LLMs can generate reasonably good outputs,
we think it is important to use high-quality human-
written articles as references for further research.
For experiments in this work, we randomly se-
lect 100 samples with human-written articles un-
der 3000 words to have a meaningful comparison.
Table 7 gives the data statistics. Notably, human-
authored articles have a large number of references
but they require numerous edits to achieve this. Fig-
ure 4 illustrates the evolution of the reference count
in the article edit process and Figure 5 gives the dis-
tribution of edit counts for human-authored articles
used in our experiments. We calculate the soft heading recall between the
multi-level headings in the generated outline, con-
sidered as the prediction P , and those in the human-
written article, considered as the ground truth G.
The calculation is based on the soft recall defini-
tion in Fränti and Mariescu-Istodor (2023). Given
a set A = {Ai} K
i=1 , soft count of an item is defined
as the inverse of the sum of its similarity to other
items in the set:
https://en.wikipedia.org/wiki/Wikipedia:
Content_assessment https://huggingface.co/sentence-transformers/
paraphrase-MiniLM-L6-v2
12
C
Automatic Evaluation Details
C.1
Soft Heading Recall
count (A i ) = P K
1
j=1 Sim (A i , A j )
(1)
Sim (A i , A j ) = cos (embed(A i ), embed(A j )) ,
where embed(·) in Equation (1) is parameterized
by paraphrase-MiniLM-L6-v2 provided in the
Sentence-Transformers library 13 . The cardinality
13
15. 1
2
3
4
5
class GenRelatedTopicsPrompt ( dspy . Signature ):
"""
I 'm writing a Wikipedia page for a topic mentioned below . Please identify and
recommend some Wikipedia pages on closely related subjects . I 'm looking for
examples that provide insights into interesting aspects commonly associated
with this topic , or examples that help me understand the typical content and
structure included in Wikipedia pages for similar topics .
Please list the urls in separate lines .
"""
6
7
8
topic = dspy . InputField ( prefix =" Topic of interest :" , format = str )
related_topics = dspy . OutputField ()
9
10
11
12
13
14
class GenPerspectivesPrompt ( dspy . Signature ):
"""
You need to select a group of Wikipedia editors who will work together to create
a comprehensive article on the topic . Each of them represents a different
perspective , role , or affiliation related to this topic . You can use other
Wikipedia pages of related topics for inspiration . For each editor , add
description of what they will focus on .
Give your answer in the following format : 1. short summary of editor 1:
description \ n2 . short summary of editor 2: description \n ...
"""
15
16
17
18
topic = dspy . InputField ( prefix = ' Topic of interest : ' , format = str )
examples = dspy . InputField ( prefix = ' Wiki page outlines of related topics for
inspiration :\ n ' , format = str )
perspectives = dspy . OutputField ()
19
20
21
22
23
24
25
26
class GenQnPrompt ( dspy . Signature ):
"""
You are an experienced Wikipedia writer and want to edit a specific page .
Besides your identity as a Wikipedia writer , you have a specific focus when
researching the topic .
Now , you are chatting with an expert to get information . Ask good questions to
get more useful information .
When you have no more question to ask , say " Thank you so much for your help !" to
end the conversation .
Please only ask one question at a time and don 't ask what you have asked before .
Your questions should be related to the topic you want to write .
"""
27
28
29
30
31
topic = dspy . InputField ( prefix = ' Topic you want to write : ' , format = str )
persona = dspy . InputField ( prefix = ' Your specific perspective : ' , format = str )
conv = dspy . InputField ( prefix = ' Conversation history :\ n ' , format = str )
question = dspy . OutputField ()
32
33
34
35
36
37
class GenQueriesPrompt ( dspy . Signature ):
"""
You want to answer the question using Google search . What do you type in the
search box ?
Write the queries you will use in the following format :- query 1\n - query 2\ n ...
"""
38
39
40
41
topic = dspy . InputField ( prefix = ' Topic you are discussing about : ' , format = str )
question = dspy . InputField ( prefix = ' Question you want to answer : ' , format = str )
queries = dspy . OutputField ()
Listing 1: Prompts used in STORM, corresponding to Line 4, 11, 19, 22 in Algorithm 1.
16. 1
2
3
4
5
class GenAnswerPrompt ( dspy . Signature ):
"""
You are an expert who can use information effectively . You are chatting with a
Wikipedia writer who wants to write a Wikipedia page on topic you know . You
have gathered the related information and will now use the information to
form a response .
Make your response as informative as possible and make sure every sentence is
supported by the gathered information .
"""
6
7
8
9
10
11
topic = dspy . InputField ( prefix = ' Topic you are discussing about : ' , format = str )
conv = dspy . InputField ( prefix = ' Question :\ n ' , format = str )
info = dspy . InputField (
prefix = ' Gathered information :\ n ' , format = str )
answer = dspy . OutputField ( prefix = ' Now give your response :\ n ')
12
13
14
15
16
17
18
19
20
class DirectGenOutlinePrompt ( dspy . Signature ):
"""
Write an outline for a Wikipedia page .
Here is the format of your writing :
1. Use "#" Title " to indicate section title , "##" Title " to indicate
subsection title , "###" Title " to indicate subsubsection title , and so
on .
2. Do not include other information .
"""
21
22
23
topic = dspy . InputField ( prefix =" Topic you want to write : ", format = str )
outline = dspy . OutputField ( prefix =" Write the Wikipedia page outline :\ n ")
24
25
26
27
28
29
30
31
32
class RefineOutlinePrompt ( dspy . Signature ):
"""
Improve an outline for a Wikipedia page . You already have a draft outline that
covers the general information . Now you want to improve it based on the
information learned from an information - seeking conversation to make it more
comprehensive .
Here is the format of your writing :
1. Use "#" Title " to indicate section title , "##" Title " to indicate
subsection title , "###" Title " to indicate subsubsection title , and so
on .
2. Do not include other information .
"""
33
34
35
36
37
38
topic = dspy . InputField ( prefix =" Topic you want to write : ", format = str )
conv = dspy . InputField ( prefix =" Conversation history :\ n", format = str )
old_outline = dspy . OutputField ( prefix =" Current outline :\ n", format = str )
outline = dspy . OutputField (
prefix = ' Write the Wikipedia page outline :\ n ')
Listing 2: Prompts used in STORM (continue), corresponding to Line 24, 31, 32 in Algorithm 1.
17. of A is the sum of the counts of its individual items:
card(A) =
K
X
count (A i )
(2)
i=1
The soft heading recall is calculated as
Algorithm 1: STORM
Input :Topic t, maximum perspective N ,
maximum conversation round M
Output :Outline O, references R
1 P0 = "basic fact writer ..." // Constant.
2 R ← []
3 // Discover perspectives P.
4 related_topics ← gen_related_topics(t)
5 tocs ← [ ]
6 foreach related_t in related_topics do
7
article ← get_wiki_article(related_t)
if article then
8
tocs.append(extract_toc(article))
9
end
10 end
11 P ← gen_perspectives(t, tocs)
12 P ← [P0] + P[:N ]
13 // Simulate conversations.
14 convos ← [ ]
15 foreach p in P do
16
convo_history ← [ ]
17
for i = 1 to M do
18
// Question asking.
19
q ← gen_qn(t, p, dlg_history)
20
convo_history.append(q)
21
// Question answering.
22
queries ← gen_queries(t, q)
23
sources ←
search_and_sift(queries)
24
a ← gen_ans(t, q, sources)
25
convo_history.append(a)
26
R.append(sources)
27
end
28
convos.append(convo_history)
29 end
30 // Create the outline.
31 O D ← direct_gen_outline(t)
32 O ← refine_outline(t, O D , convos)
33 return O, R
soft heading recall =
card(G ∩ P )
,
card(G)
(3)
where the cardinality of intersection is defined via
the union as follows:
card(G ∩ P ) =
card(G) + card(P ) − card(G ∪ P ).
C.2
(4)
LLM Evaluator
We use Prometheus 14 (Kim et al., 2023), a 13B
open-source evaluator LLM that can assess long-
form text based on customized 1-5 scale rubric, to
grade the article from the aspects of Interest level,
Coherence and Organization, Relevance and Fo-
cus, and Coverage. Table 8 gives our grading rubric.
While Prometheus is best used with a score 5 ref-
erence answer, we find adding the reference will
exceed the context length limit of the model. Since
Kim et al. (2023) show Prometheus ratings without
reference also correlate well with human prefer-
ences, we omit the reference and trim the input
article to be within 2000 words by iteratively re-
moving contents from the shortest section to ensure
the input can fit into the model’s context window.
C.3
More Discussion of the Citation Quality
Irrelevant
Source
Others
4%
1%
Inaccurate
Paraphrasing
7%
Improper
Inferential Linking
14%
Lack Citation
47%
Incorrectly Split
12%
False Negative
15%
Figure 6: Error analysis of unsupported sentences in 10
sampled articles.
https://huggingface.co/kaist-ai/
prometheus-13b-v1.0
14
18. Criteria Description
Score 1 Description
Score 2 Description
Score 3 Description
Score 4 Description
Score 5 Description Interest Level: How engaging and thought-provoking is the article?
Not engaging at all; no attempt to capture the reader’s attention.
Fairly engaging with a basic narrative but lacking depth.
Moderately engaging with several interesting points.
Quite engaging with a well-structured narrative and noteworthy points that frequently capture and retain attention.
Exceptionally engaging throughout, with a compelling narrative that consistently stimulates interest.
Criteria Description
Score 1 Description
Score 2 Description
Score 3 Description
Score 4 Description
Score 5 Description Coherence and Organization: Is the article well-organized and logically structured?
Disorganized; lacks logical structure and coherence.
Fairly organized; a basic structure is present but not consistently followed.
Organized; a clear structure is mostly followed with some lapses in coherence.
Good organization; a clear structure with minor lapses in coherence.
Excellently organized; the article is logically structured with seamless transitions and a clear argument.
Criteria Description
Score 1 Description
Score 2 Description
Score 3 Description
Score 4 Description Relevance and Focus: Does the article stay on topic and maintain a clear focus?
Off-topic; the content does not align with the headline or core subject.
Somewhat on topic but with several digressions; the core subject is evident but not consistently adhered to.
Generally on topic, despite a few unrelated details.
Mostly on topic and focused; the narrative has a consistent relevance to the core subject with infrequent digressions.
Exceptionally focused and entirely on topic; the article is tightly centered on the subject, with every piece of information contributing
to a comprehensive understanding of the topic.
Score 5 Description
Criteria Description
Score 1 Description
Score 2 Description
Score 3 Description
Score 4 Description
Score 5 Description
Broad Coverage: Does the article provide an in-depth exploration of the topic and have good coverage?
Severely lacking; offers little to no coverage of the topic’s primary aspects, resulting in a very narrow perspective.
Partial coverage; includes some of the topic’s main aspects but misses others, resulting in an incomplete portrayal.
Acceptable breadth; covers most main aspects, though it may stray into minor unnecessary details or overlook some relevant points.
Good coverage; achieves broad coverage of the topic, hitting on all major points with minimal extraneous information.
Exemplary in breadth; delivers outstanding coverage, thoroughly detailing all crucial aspects of the topic without including irrelevant
information.
Table 8: Scoring rubrics on a 1-5 scale for the evaluator LLM.
Error Type
Topic Unsupported Sentence Source
Improper Inferential Linking Lahaina, Hawaii Throughout its history, religion has remained the
paramount aspect of Hawaiian life in Lahaina ,
permeating every daily activity and significant event[5]. [5] “Religion, Beliefs & Spirituality”
(The source discusses religion as part of Hawaiian life
but does not mention Lahania .)
Inaccurate Paraphrasing 2022 Crimean
Bridge explosion Completed in June 2020 , the bridge serves as a
major supply route for Russian forces in the region
and is significant to Russia’s claim over the disputed
territory[2][11]. [2] “Crimean Bridge - Wikipedia”
(The source says “The first scheduled passenger train
crossed the bridge on 25 December 2019, while the
bridge was opened for freight trains on 30 June 2020 ”.)
Citing Irrelevant Sources LK-99 For example, comparisons have been drawn between
the performance of LK-9 and the dynamic resolution
capabilities of video games such as Battlefield 2042[22]. [22] “Battlefield 2042 PC performance guide: The best
settings for a high frame rate”
( The source is irrelevant to LK-99. )
Table 9: Examples of different error types of unsupported sentences.
19. We use Mistral 7B-Instruct 15 (Jiang et al., 2023a)
to examine whether the cited passages entail the
generated sentence. Table 4 reports the citation
quality of articles produced by our approach, show-
ing that around 15% sentences in generated articles
are unsupported by citations. We further investi-
gate the failure cases by randomly sampling 10
articles and an author manually examines all the
unsupported sentences in these articles. Besides
sentences that are incorrectly split 16 , lack citations,
or are deemed supported by the author’s judgment,
our analysis identifies three main error categories
(examples are given in Table 9): improper inferen-
tial linking, inaccurate paraphrasing, and citing
irrelevant sources.
We show the error distribution in Figure 6. No-
tably, the most common errors stem from the ten-
dency of LLMs to form improper inferential links
between different pieces of information presented
in the context window. Our analysis of citation
quality suggests that, in addition to avoiding hallu-
cinations, future research in grounded text gener-
ation should also focus on preventing LLMs from
making overly inferential leaps based on the pro-
vided information.
D
Human Evaluation Details
We recruited 10 experienced Wikipedia editors
to participate in our study by creating a research
page on Meta-Wiki 17 and reaching out to active
editors who have recently approved articles for
Wikipedia. 18 Our participation group includes 3
editors with 1-5 years of experience, 4 with 6-10
years, and 3 with over 15 years of contribution.
The study was approved by the Institutional Re-
view Board of our institution and the participants
signed the consent form through Qualtrics ques-
tionnaires before the study started.
To streamline the evaluation of grounded articles,
we developed a web application, which features a
side-by-side display of the article and its citation
snippets, to gather ratings and open-ended feedback
https://huggingface.co/mistralai/
Mistral-7B-Instruct-v0.1
16
Following Gao et al. (2023), we check citation quality in
the sentence level and split articles into sentences using NLTK
sent_tokenize. sent_tokenize sometimes fails to split sen-
tences correctly when the article contains special words like
“No.12847”, “Bhatia et al.”, etc.
17
https://meta.wikimedia.org
18
Since evaluating Wikipedia-like articles is time-
consuming and requires expertise, we paid each participant
50$ for our study.
15
for each article. Figure 7 shows the screenshot of
our web application and the full article produced
by STORM is included in Table 12. For human
evaluation, we use a 1 to 7 scale for more fine-
grained evaluation. The grading rubric is included
in Table 10.
We collected the pairwise preferences and the
perceived usefulness of STORM via an online ques-
tionnaire. Specifically, for the perceived usefulness,
we request editors to rate their agreement with state-
ments “I think it can be specifically helpful for my
pre-writing stage (e.g., collecting relevant sources,
outlining, drafting).”, “I think it will help me edit
a Wikipedia article for a new topic”, “I think it
can be a potentially useful tool for the Wikipedia
community” on a Likert scale of 1-5, correspond-
ing to Strongly disagree, Somewhat disagree, Nei-
ther agree nor disagree, Somewhat agree, Strongly
agree.
E
Error Analysis
While articles produced by STORM are preferred
by both automatic metrics and human evaluation,
experienced editors still identified multiple prob-
lems with the machine-generated articles. We an-
alyze the free-form comments and summarize the
major issues in Table 11.
The primary issue raised is that the generated
articles often contain emotional language and lack
neutrality, primarily due to the source material.
STORM currently retrieves grounding sources
from the Internet which is not neutral and con-
tains considerable promotional content on its own.
Addressing this bias in the pre-writing stage repre-
sents a valuable direction for future research. An-
other major issue is the red herring fallacy or the
over-association of unrelated facts. Addressing this
challenge calls for high-level sensemaking rather
than mere fact-level verification.
20. Interest Level
1: Not engaging at all; no attempt to capture the reader’s attention.
2: Slightly engaging with rare moments that capture attention.
3: Fairly engaging with a basic narrative but lacking depth.
4: Moderately engaging with several interesting points.
5: Quite engaging with a well-structured narrative and noteworthy points that frequently capture and retain attention.
6: Very engaging with a compelling narrative that captures and mostly retains attention.
7: Exceptionally engaging throughout, with a compelling narrative that consistently stimulates interest.
Coherence and Organization
1: Disorganized; lacks logical structure and coherence.
2: Poor organization; some structure is evident but very weak.
3: Fairly organized; a basic structure is present but not consistently followed.
4: Organized; a clear structure is mostly followed with some lapses in coherence.
5: Good organization; a clear structure with minor lapses in coherence.
6: Very well-organized; a logical structure with transitions that effectively guide the reader.
7: Excellently organized; the article is logically structured with seamless transitions and a clear argument.
Relevance and Focus
1: Off-topic; the content does not align with the headline or core subject.
2: Mostly off-topic with some relevant points.
3: Somewhat on topic but with several digressions; the core subject is evident but not consistently adhered to.
4: Generally on topic, despite a few unrelated details.
5: Mostly on topic and focused; the narrative has a consistent relevance to the core subject with infrequent digressions.
6: Highly relevant with a focused narrative and purpose.
7: Exceptionally focused and entirely on topic; the article is tightly centered on the subject, with every piece of information contributing to a
comprehensive understanding of the topic.
Broad Coverage
1: Severely lacking; offers little to no coverage of the topic’s primary aspects, resulting in a very narrow perspective.
2: Minimal coverage; addresses only a small selection of the topic’s main aspects, with significant omissions.
3: Partial coverage; includes some of the topic’s main aspects but misses others, resulting in an incomplete portrayal.
4: Acceptable breadth; covers most main aspects, though it may stray into minor unnecessary details or overlook some relevant points.
5: Good coverage; achieves broad coverage of the topic, hitting on all major points with minimal extraneous information.
6: Comprehensive; provides thorough coverage of all significant aspects of the topic, with a well-balanced focus.
7: Exemplary in breadth; delivers outstanding coverage, thoroughly detailing all crucial aspects of the topic without including irrelevant information.
Verifiability
1: No supporting evidence; claims are unsubstantiated.
2: Rarely supported with evidence; many claims are unsubstantiated.
3: Inconsistently verified; some claims are supported; evidence is occasionally provided.
4: Generally verified; claims are usually supported with evidence; however, there might be a few instances where verification is lacking
5: Well-supported; claims are very well supported with credible evidence, and instances of unsupported claims are rare.
6: Very well-supported; almost every claim is substantiated with credible evidence, showing a high level of thorough verification.
7: Exemplary verification; each claim is supported by robust, credible evidence from authoritative sources, reflecting strict adherence to the no
original research policy.
Table 10: Scoring rubrics on a 1-7 scale for human evaluation.
21. Issue
Mentioned Time
Example Comments
The word “significant” is used 17 times in this article. Vague and unsupported claims are
made about broader political importance and “pivotal role[s]”, and is unencyclopedic.
(comment on article Lahaina, Hawaii)
Use of emotional words,
unneutral
12
[...] but they still have not fixed the issue of neutral point of view. It is also evident in this
article that the writer’s standpoint is biased towards Taylor Swift. Other than that, it did
a good job at summarizing key points and putting depth into this.
(comment on article Speak Now (Taylor’s Version))
“The film was also featured in an art and film festival hosted by The California Endowment,
highlighting the power of stories in reshaping narratives about communities.” Yes, technically
the source says that, but it’s a stretch to say in Wikipedia voice and just sounds like
non-neutral, promotional prose. (comment on article Gehraiyaan)
Polling from America shouldn’t be included and links to climate change shouldn’t be
made unless explicitly connected by the source. (comment on article Typhoon Hinnamnor)
Red herring fallacy,
associating unrelated sources
11
Sourcing seems mostly fine, though some aren’t directly related (Ex. 39,40).
(comment on article Gehraiyaan)
Here is a lengthy digression about KISS, not necessary because the article on the band
should be linked to. (comment on article 2022 AFL Grand Final)
Missing important information
6
“One study, conducted by Sinéad Griffin, a physicist at the Lawrence Berkeley National
Laboratory, provided some analysis of LK-99’s abilities using supercomputer simulations[20].”
This is not enough information about the analysis, which would have been very useful in the
article. (comment on article LK-99)
Although the earthquake’s immediate aftermath and response are adequately covered, there
could be more about the long-term socioeconomic impact and recovery processes.
(comment on article 2022 West Java earthquake)
Improper handling of
time-sensitive information
5
Words like “now” should be avoided in Wikipedia articles to prevent them from becoming
dated and phrases such as, “as of December 2023” should be used instead.
(comment on article Cyclone Batsirai)
“as of December 13” doesn’t specify a year, and is old information
(comment on article 2022 West Java earthquake)
too many subsections in the “Recovery and Rehabilitation” section
(comment on article 2022 West Java earthquake)
Section organization problem
5
I do not like how the article is organized, with too many headers cluttering the article,
making it not as readable. Other than that, the AI did great work on the piece.
(comment on article 2022 Crimean Bridge explosion)
Table 11: Summary of major issues found on articles produced by STORM.
22. Figure 7: Screenshot of the web application for evaluating the grounded article.
23. Taylor Hawkins
Oliver Taylor Hawkins (1972-2022) was an American musician, best known as the drummer for the rock
band Foo Fighters[1]. Born in Fort Walton, Texas, Hawkins’ love for music was ignited at a young age,
particularly after watching a Queen concert in 1982[2][3][5]. He kick-started his professional career
as the drummer for Alanis Morissette’s band during her world tour for the hit album ’Jagged Little
Pill’[8][9]. His talents were recognized by Dave Grohl, who invited him to join the Foo Fighters in
1997, marking the beginning of his impactful tenure with the band[7][8].
Hawkins was celebrated for his versatile drumming style, drawing influence from renowned drummers like
Roger Taylor, Neil Peart, Phil Collins, Alex Van Halen, and Stewart Copeland[14]. His performances,
marked by a unique energy and aggressive style of play, earned him recognition as one of the top rock
drummers of his era[15]. Apart from his role in the Foo Fighters, Hawkins’ passion for music saw
him involved in numerous side projects and collaborations, cementing his place in the world of rock
music[10].
Outside of his professional life, Hawkins was known for his resilience and dedication to his family.
Despite personal struggles, including a near-fatal drug overdose in 2001, Hawkins remained committed to
his musical career[4][9]. His legacy continues to inspire musicians and fans alike, as his contributions
to rock music, coupled with his indomitable spirit, made him an unforgettable icon in the music
industry[13].
Hawkins’ sudden death in 2022 while on tour in Bogotá, Colombia, sent shockwaves through the music
world[34]. Tributes poured in from around the globe, reflecting the respect and admiration Hawkins
had garnered during his lifetime[21][31]. His life and career were honored at a star-studded tribute
concert in Los Angeles, attesting to the enduring impact of his music[22].
# Early Life and Background
24. Oliver Taylor Hawkins, known as Taylor Hawkins, was born and raised in Fort Walton, Texas[3]. His
family moved to Laguna Beach, California when he was four years old[3]. He has two younger siblings,
a brother named Jason, and a sister named Heather[3]. As a child, Hawkins was particularly influenced
by his paternal grandmother, Josie Hawkins, who had grown up during the Great Depression and lived in
Jackson, Mississippi[1].
During his high school days at Laguna Beach High School, from where he graduated in 1990, he became
friends with Jon Davison, who later became the lead vocalist of the band Yes[2][3]. His interest in
music was nurtured from an early age, particularly after watching a Queen concert in 1982 which inspired
him to learn to play the drums[2][5]. He noted that music was a constant presence in his family home[5].
Despite facing certain hardships during his upbringing, including his mother’s struggles with "demons",
Hawkins pursued his musical ambitions[4]. He credits his older sister Heather for taking care of the
family during difficult times[4].
His first major musical experience came from playing drums for Alanis Morissette’s album, Jagged Little
Pill, and accompanying her on the subsequent tour[3]. This marked the beginning of his professional
career in the music industry.
# Career
Taylor Hawkins began his professional music career playing in Alanis Morissette’s band during her
18-month world tour in support of the hit album ’Jagged Little Pill’ from 1995 to 1997[8][9]. His
performances not only in the tour but also in the music videos for “You Oughta Know”, “All I Really Want”
and “You Learn” introduced him to the world of rock music and ultimately led to his meeting with Dave
Grohl[8]. Throughout this time, Hawkins contributed significantly to the band’s sound and performance,
transforming the songs from their original drum loop format to a rock-band vibe that resonated with
audiences[1][7].
In 1997, Hawkins was asked by Grohl to join the Foo Fighters, an invitation that he readily accepted[7][8].
At the time, Grohl thought it was a long shot to recruit Hawkins given that Morissette was at the height
of her career, but Hawkins’ desire to be a part of a rock band compelled him to make the move[7]. This
marked the beginning of Hawkins’ tenure as the drummer of the Foo Fighters, a role that he would play
until his passing[6][9].
Apart from his work with Morissette and the Foo Fighters, Hawkins had an array of other musical
experiences[10]. He drummed for Sass Jordan before joining Morissette’s touring band[10]. He was part
of an ad hoc drum supergroup called SOS Allstars and filled the void for Coheed and Cambria’s 2007
album after their drummer Josh Eppard left the group[10]. In addition, Hawkins formed his own side
project, the Coattail Riders, in 2005, through which he recorded his own music and took the project on
the road, performing in small clubs despite the Foo Fighters’ arena-status[7]. His son, Shane Hawkins,
has since taken on his father’s legacy, joining the Foo Fighters for a performance during the Boston
Calling Music Festival in 2023[6].
# Musical Style and Influences
Taylor Hawkins was a profound drummer, with his musical style and influences spreading across a wide
array of rock genres[11]. Known for his passionate fandom of groups that came before him, Hawkins
regularly expressed his admiration for bands like Rush, Genesis, and the Police, all of which featured
some of the greatest drummers in rock history like Neil Peart, Phil Collins, and Stewart Copeland[11].
He was heavily influenced by his love for classic rock, as evidenced by his performances, where he
covered songs from bands like Van Halen[11].
25. Hawkins drew influences from a variety of drumming styles, developing a signature style inspired by
greats like Roger Taylor, Neil Peart, Phil Collins, Alex Van Halen, and Stewart Copeland[14]. This
distinctive style and influence extended to his drum kit, which incorporated elements like rototoms
and concert toms[14].
Beyond his influences, Hawkins had a unique energy that made him stand out as a drummer. His performances
were recognized for their power, and he was known for his enthusiastic and aggressive style of play[15].
This earned him recognition as one of the top rock drummers of his time, with his passion for music
living on through his performances[14].
Through his career, Hawkins left an indelible mark on rock music, through his distinct style, passion,
and contributions to the music industry[13]. His love for music and dedication to his craft made him
an unforgettable icon in the world of rock music[13].
# Personal Life
Taylor Hawkins married Alison Hawkins, an American celebrity and entrepreneur, in 2005[18]. The couple
had three children, Oliver, Annabelle, and Everleigh[19]. Hawkins’ commitment to his family was evident;
in fact, he even wrote a song for his middle child, Annabelle[9].
In his personal life, Hawkins had also struggled with drug use, which nearly claimed his life in a 2001
overdose[9][7][4]. However, he managed to overcome this challenge, and later expressed gratitude for
the experience as a lesson that allowed him to realize the destructive path he was on[7].
Outside of his main role in the Foo Fighters, Hawkins also pursued various side projects including the
Birds of Satan, NHC, and Chevy Metal. His motivation for such ventures was a constant drive to create
and his love for music[7]. Hawkins was also known for his unabashed fanboy nature, often vocalizing
his admiration for fellow musicians and his heroes[7].
# Legacy and Impact
Taylor Hawkins was known for his raw and authentic drumming style, described as "courageous, damaged
and unflinchingly authentic"[20]. His work with the Foo Fighters, as well as his various collaborations
and side projects, made him a celebrated figure in rock ‘n’ roll[10].
Hawkins’ death in 2022 was met with heartfelt tributes from colleagues and fans around the world.
Notable tributes came from rock legends like Roger Taylor of Queen, who considered Hawkins as a kind,
brilliant man and an inspirational mentor, likening his death to "losing a younger favourite brother"[21].
Similarly, Led Zeppelin’s Jimmy Page admired his technique, energy and spirited enthusiasm[21].
An LA tribute concert held in his honor included guest drummers like Lars Ulrich of Metallica, Travis
Barker of blink-182, and Brad Wilk of Rage Against the Machine. Singers like Miley Cyrus and Alanis
Morissette also performed at the concert[22].
Apart from his music, Taylor Hawkins also contributed to charities Music Support and MusiCares, both of
which were chosen by the Hawkins family[23]. He had received numerous accolades throughout his career,
including 27 Grammy nominations, of which he won 14[2]. In 2021, the Foo Fighters were inducted into
the Rock and Roll Hall of Fame[9].
# Discography
Taylor Hawkins also led a notable music career through his own side projects and collaborations[10].
Aside from his work with the Foo Fighters, Hawkins formed and fronted the band Taylor Hawkins & The
Coattail Riders, a project which originated from jamming sessions with his friend Drew Hester[10].
### Taylor Hawkins & The Coattail Riders
26. Taylor Hawkins & The Coattail Riders, a band formed in 2004, have released three albums and their
music spans genres including Hard Rock, Art Rock, and Alternative Rock[24][25][26]. The band grew from
an initial casual jamming session, gradually evolving into a more formal arrangement that led to the
production of record albums. Notably, these albums featured guest appearances by renowned musicians
such as Dave Grohl, Queen’s Brian May and Roger Taylor, The Cars’ Elliot Easton, Perry Farrell, and
Jon Davison, who is a school friend of Hawkins’[10].
### Red Light Fever
Red Light Fever, released on April 19, 2010, was the band’s first album[29][30]. Prior to its release,
Hawkins revealed in an interview that the album had completed the recording and production stages, but
its title and release date were yet to be determined[29]. Red Light Fever was recorded at the Foo
Fighters’ Studio 606 in California and featured guest musicians such as Brian May and Roger Taylor of
Queen, Dave Grohl of Foo Fighters, and Elliot Easton of The Cars[29][30].
## Get the Money
Get the Money, the third album from Taylor Hawkins & The Coattail Riders, was released on November 8,
2019[29]. The album’s first single, "Crossed the Line", released on October 15, 2019, featured Dave
Grohl and Jon Davison, the frontman of Yes[29]. The music video for the single "I Really Blew It" also
featured appearances from Grohl and Perry Farrell[29].
# Collaborations and Guest Appearances
Throughout his career, Taylor Hawkins collaborated with various prominent artists and bands.
The
Coattail Riders’ albums notably featured appearances from luminaries such as Brian May and Roger Taylor
of Queen, Chrissie Hynde, Nancy Wilson of Heart, Sex Pistol Steve Jones and James Gang’s Joe Walsh[28].
Hawkins also fronted another group, The Birds of Satan, which evolved from his heavy rock covers band,
Chevy Metal[28].
Despite his diverse musical engagements, Hawkins always maintained a close allegiance with the Foo
Fighters, which remained the center of his music life[7][28].
# Tragic Passing
Taylor Hawkins, the esteemed drummer of the alt-rock band Foo Fighters, passed away suddenly on March
25, 2022, while on tour with his band in Bogotá, Colombia[34]. The official cause of death was cardiac
arrest, though inquiries were raised concerning the presence of drugs in his system and their potential
contribution to his death[33][34]. On the night of his passing, paramedics were called to the Four
Seasons hotel in Bogotá due to reports of chest pain from an unnamed guest, later revealed to be
Hawkins[34]. Unfortunately, resuscitation efforts were unsuccessful, and Hawkins was declared dead at
the scene[34].
The news of Hawkins’ sudden demise was announced on the morning of March 25th, 2022, which left the music
world in shock[32]. The band confirmed the news with a short statement, expressing their devastation
at the loss of Hawkins, whose "musical spirit and infectious laughter" would live on forever[32].
As a result of Hawkins’ untimely passing, the band canceled their ongoing South American tour[33]. The
festival stage at the Estéreo Picnic Festival, where the Foo Fighters were scheduled to perform that
night, was transformed into a candlelight vigil in memory of Hawkins[33].
## Tributes and Remembrances
27. In the wake of Hawkins’ death, tributes from fans and colleagues alike poured in from around the
world[21][31]. Among the many paying their respects were legendary rock and roll musicians like Roger
Taylor, the drummer of Queen, who Hawkins credited with inspiring his own career behind the drum set[21].
In heartfelt social media posts, Taylor described Hawkins as an "inspirational mentor" and a "kind
brilliant man"[21], while Led Zeppelin’s Jimmy Page reminisced about sharing the stage with Hawkins
and praised his "technique, energy and spirited enthusiasm"[21].
There were also numerous onstage tributes to Hawkins. Notably, Miley Cyrus expressed her grief and sent
peaceful wishes to the Foo Fighters and the Hawkins family during a performance at Lollapalooza[31].
Similarly, Liam Gallagher of Oasis dedicated one of the band’s biggest hits to Hawkins during a concert
at the Royal Albert Hall in London[31].
Fans gathered outside the hotel where Hawkins died, lighting candles, leaving flowers, and singing the
band’s songs in his honor[31].
Hawkins’ life and career were celebrated in a star-studded tribute concert in Los Angeles, which saw
performances from over 50 musicians, including his former bands and colleagues from Def Leppard, Queen,
and Foo Fighters[22].
Table 12: STORM’s generated article for “Taylor Hawkins”. “#”, “##” indicate the section title and subsection title
respectively. Numbers in brackets indicate the cited references.