Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

如果无法正常显示，请先停止浏览器的去广告插件。

1. Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models Yijia Shao Yucheng Jiang Theodore A. Kanell Peter Xu Omar Khattab Monica S. Lam Stanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu Abstract We study how to apply large language models to write grounded and organized long-form ar- ticles from scratch, with comparable breadth and depth to Wikipedia pages. This underex- plored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writ- ing. We propose STORM, a writing system for the Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Ask- ing. STORM models the pre-writing stage by (1) discovering diverse perspectives in research- ing the given topic, (2) simulating conversa- tions where writers carrying different perspec- tives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the col- lected information to create an outline. For evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage. We further gather feedback from experienced Wikipedia editors. Com- pared to articles generated by an outline- driven retrieval-augmented baseline, more of STORM’s articles are deemed to be organized (by a 25% absolute increase) and broad in cov- erage (by 10%). The expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts. 1 Introduction Large language models (LLMs) have demonstrated impressive writing capabilities (Yang et al., 2023; Pavlik, 2023; Wenzlaff and Spaeth, 2022; Fitria, 2023), but it is unclear how we can use them to write grounded, long-form articles, like full-length Wikipedia pages. Such expository writing, which seeks to inform the reader on a topic in an or- ganized manner (Weaver III and Kintsch, 1991; Balepur et al., 2023), requires thorough research and planning in the pre-writing stage (Rohman, Prewriting Writing References Topic 2022 Winter Olympics Opening Ceremony Outline Full-length Article Research via Question Asking (A) Direct Prompting Prompt: Ask 30 questions about the given topic. LLM 1. When was the opening ceremony held? 2. Where was the opening ceremony held? 3. How many countries participated in the opening ceremony? ... (B) Perspective-Guided Question Asking Prompt: You are an event planner who focuses on the preparation of the opening ceremony. … 1. Can you provide any information about the transportation arrangements for the opening ceremony? 2. Can you provide any information about the budget for the LLM 2022 Winter Olympics opening ceremony? … (C) Conversational Question Asking LLM- Role1 LLM- Role2 LLM- Role1 Can you provide me with a list of the participating countries in the 2022 Winter Olympics opening ceremony? The 2022 Winter Olympics featured a diverse group of countries participating in the opening ceremony. These included … Athletes from over 90 countries will enter the stadium in a specific order. How is the order of participating countries in the 2022 Winter Olympics opening ceremony determined? Figure 1: We explore writing Wikipedia-like articles from scratch, which demands a pre-writing stage before producing the article. In this stage, simpler approaches like Direct Prompting have limited planning capacity. In contrast, STORM researches the topic via perspective- guided question asking in simulated conversations. 1965), even before the actual writing process can start. However, prior work on generating Wikipedia articles (Banerjee and Mitra, 2015; Minguillón et al., 2017; Liu et al., 2018; Fan and Gardent, 2022) has generally bypassed the pre-writing stage: for instance, Liu et al. (2018) presume reference documents are provided in advance, while Fan and Gardent (2022) assume an article outline is avail- able and focus on expanding each section. These assumptions do not hold in general, as collecting references and crafting outlines demand advanced information literacy skills (Doyle, 1994) to iden-

2. tify, evaluate, and organize external sources - a task that is challenging even for experienced writers. Automating this process can facilitate individuals in initiating in-depth learning about a topic and greatly reduce the expensive expert hours neces- sary for their expository writing. We explore these challenges by focusing on how to generate Wikipedia-like articles from scratch. We decompose this problem into two tasks. The first is to conduct research to generate an outline, i.e., a list of multi-level sections, and collect a set of reference documents. The second uses the outline and the references to produce the full-length arti- cle. Such a task decomposition mirrors the human writing process which usually includes phases of pre-writing, drafting, and revising (Rohman, 1965; Munoz-Luna, 2015). As pre-trained language models inherently pos- sess a wealth of knowledge, a direct approach is to rely on their parametric knowledge for generating outlines or even entire articles (Direct Gen). How- ever, this approach is limited by a lack of details and hallucinations (Xu et al., 2023), particularly in addressing long-tail topics (Kandpal et al., 2023). This underscores the importance of leveraging ex- ternal sources, and current strategies often involve retrieval-augmented generation (RAG), which cir- cles back to the problem of researching the topic in the pre-writing stage, as much information cannot be surfaced through simple topic searches. Human learning theories (Tawfik et al., 2020; Booth et al., 2003) highlight asking effective questions in information acquisition. Although instruction-tuned models (Ouyang et al., 2022) can be prompted directly to generate questions, we find that they typically produce basic “What”, “When”, and “Where” questions (Figure 1 (A)) which often only address surface-level facts about the topic. To endow LLMs with the capacity to conduct better research, we propose the STORM paradigm for the Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking. The design of STORM is based on two hypothe- ses: (1) diverse perspectives lead to varied ques- tions; (2) formulating in-depth questions requires iterative research. Building upon these hypotheses, STORM employs a novel multi-stage approach. It first discovers diverse perspectives by retrieving and analyzing Wikipedia articles from similar top- ics and then personifies the LLM with specific per- spectives for question asking (Figure 1 (B)). Next, to elicit follow-up questions for iterative research (Figure 1 (C)), STORM simulates multi-turn con- versations where the answers to the generated ques- tions are grounded on the Internet. Finally, based on the LLM’s internal knowledge and the collected information, STORM creates an outline that can be expanded section by section to develop a full- length Wikipedia-like article. We evaluate STORM using our FreshWiki dataset (§2.1) which curates recent, high-quality Wikipedia articles to avoid data leakage during pre- training. 1 To facilitate the study of the pre-writing stage, we define metrics for evaluating the outline quality against human-written articles. We further invited a group of experienced Wikipedia editors for expert evaluation. The ed- itors found STORM outperforms an outline-driven RAG baseline, especially regarding the breadth and organization of the articles. They also identified challenges for future research, including address- ing cases where: (1) the bias on the Internet affects the generated articles; (2) LLMs fabricate connec- tions between unrelated facts. These challenges present new frontiers to grounded writing systems. Our main contributions include: • To evaluate the capacity of LLM systems at generating long-form grounded articles from scratch, and the pre-writing challenge in par- ticular, we curate the FreshWiki dataset and establish evaluation criteria for both outline and final article quality. • We propose STORM, a novel system that au- tomates the pre-writing stage. STORM re- searches the topic and creates an outline by using LLMs to ask incisive questions and re- trieving trusted information from the Internet. • Both automatic and human evaluation demon- strate the effectiveness of our approach. Ex- pert feedback further reveals new challenges in generating grounded long-form articles. 2 FreshWiki We study generating Wikipedia-like articles from scratch, placing emphasis on the pre-writing stage (Rohman, 1965), which involves the demand- ing sub-tasks of gathering and curating relevant information (“research”). This models the human Our resources and code are released at https://github. com/stanford-oval/storm. 1

3. Domain Scope Given Outline? Given Refs? Balepur et al. (2023) Qian et al. (2023) Fan and Gardent (2022) Liu et al. (2018) Sauper and Barzilay (2009) One All One All Two One para. One para. Full article One para. Full article / / Yes / No Yes No No Yes No Ours All Full article No No Table 1: Comparison of different Wikipedia generation setups in existing literature. Generating one paragraph does not need an article outline. writing approach which has prompted some educa- tors to view Wikipedia article writing as an educa- tional exercise for academic training (Tardy, 2010). Table 1 compares our work against prior bench- marks for Wikipedia generation. Existing work has generally focused on evaluating the generation of shorter snippets (e.g., one paragraph), within a narrower scope (e.g., a specific domain or two), or when an explicit outline or reference documents are supplied. A notable example is WikiSum (Liu et al., 2018), which treats generating Wikipedia ar- ticles as a multi-document summarization problem, with respect to the reference documents. Our setup emphasizes the capability of long- form grounded writing systems to research and curate content. Specifically, given a topic t, the task is to find a set of references R and generate a full-length article S = s 1 s 2 ...s n , where each sentence s i cites a list of documents in R. 2 2.1 The FreshWiki Dataset Creating a new Wikipedia-like article demands not only fluent writing but also good research skills. As modern LLMs are generally trained on Wikipedia text, we mitigate data leakage by explicitly seeking out recent Wikipedia articles that were created (or very heavily edited) after the training cutoff of the LLMs we test. Our process can be repeated at future dates when new LLMs emerge. To apply our date criteria, we focus on the top 100 most-edited pages, based on edit counts, for each month from February 2022 to September 2023 3 . To ensure high-quality references, we filter these articles to keep only those having B-class quality or above assessed by ORES 4 . We also ex- 2 In practice, S also includes organizational elements such as section and subsection titles, which do not require citations. 3 Obtained from https://wikimedia. org/api/rest_v1/metrics/edited-pages/ top-by-edits/en.wikipedia/all-editor-types/ content/{year}/{month}/all-days 4 https://www.mediawiki.org/wiki/ORES clude list articles 5 and articles that have no sub- sections. While high-quality Wikipedia articles usually contain structured data (e.g., tables) and are multi-modal, we only consider the plain text com- ponent in constructing the dataset to simplify our task. More details of the dataset are in Appendix A. 2.2 Outline Creation and Evaluation A full-length article is hard to generate or evalu- ate (Xu et al., 2023; Krishna et al., 2023). When human educators teach students academic writing, they sometimes supervise students at the outline stage (Eriksson and Mäkitalo, 2015) because an extensive outline indicates a comprehensive under- standing of the topic and provides a solid founda- tion for writing the full-length article (Dietz and Foley, 2019). Inspired by this, we decompose the generation of S into two stages. In the pre-writing stage, we require the system to create an outline O, which is defined as a list of multi-level section headings 6 . In the writing stage, the system uses the topic t, the references R, and an outline O to produce the full-length article S. To evaluate the outline coverage, we introduce two metrics: heading soft recall and heading en- tity recall. These metrics compare the multi-level section headings of the human-written article, con- sidered as ground truth, and those in O. Recog- nizing that an exact match between elements in these two sets of headings is unnecessary, we cal- culate the heading soft recall (Fränti and Mariescu- Istodor, 2023) using cosine similarity derived from Sentence-BERT (Reimers and Gurevych, 2019) em- beddings of the headings (details in Appendix C.1). We also compute the heading entity recall which is quantified as the percentage of named entities in human-written article headings covered by O. We extract entities with FLAIR named entity recogni- tion (NER) (Akbik et al., 2019). 3 Method We present STORM to automate the pre-writing stage by researching a given topic via effective question asking (§3.1, §3.2) and creating an out- line (§3.3). The outline will be extended to a full- length article grounded on the collected references 5 https://en.wikipedia.org/wiki/Wikipedia: Stand-alone_lists 6 Since language models process and produce sequences, we can linearize O by adding “#” to indicate section titles, “##” to indicate subsection titles, etc.

4. Topic 𝒕 ① Survey Question 𝒒 ② Identify Perspectives Wikipedia Writer 𝒫 ③ Related Articles Expert Read & Ask Direct Generate Draft Outline 𝒪 ! Split Queries ⑤ Search & Sift ⑥ Synthesize Answer 𝒂 Gather Add Specific Perspective ⑦ ④ ⑧ Refine Add Trusted Sources Conversations {𝒞 " , … , 𝒞 # } Outline 𝒪 References ℛ Figure 2: The overview of STORM that automates the pre-writing stage. Starting with a given topic, STORM identifies various perspectives on covering the topic by surveying related Wikipedia articles ( 1 - 2 ). It then simulates conversations between a Wikipedia writer who asks questions guided by the given perspective and an expert grounded on trustworthy online sources ( 3 - 6 ). The final outline is curated based on the LLM’s intrinsic knowledge and the gathered conversations from different perspectives ( 7 - 8 ). (§3.4). Figure 2 gives an overview of STORM and we include the pseudo code in Appendix B. 3.1 Perspective-Guided Question Asking Rohman (1965) defines pre-writing as the stage of discovery in the writing process. In parallel with stakeholder theory in business (Freeman et al., 2010), where diverse stakeholders prioritize vary- ing facets of a company, individuals with distinct perspectives may concentrate on different aspects when researching the same topic and discover mul- tifaceted information. Further, the specific perspec- tives can serve as prior knowledge, guiding individ- uals to ask more in-depth questions. For example, an event planner might ask about the “transporta- tion arrangements” and “budget” for “the 2022 Winter Olympics opening ceremony”, whereas a layperson might ask more general questions about the event’s basic information (Figure 1 (A)). Given the input topic t, STORM discovers differ- ent perspectives by surveying existing articles from similar topics and uses these perspectives to control the question asking process. Specifically, STORM prompts an LLM to generate a list of related top- ics and subsequently extracts the tables of contents from their corresponding Wikipedia articles, if such articles can be obtained through Wikipedia API 7 (Figure 2 1 ). These tables of contents are con- catenated to create a context to prompt the LLM to identify N perspectives P = {p 1 , ..., p N } that 7 https://pypi.org/project/Wikipedia-API/ can collectively contribute to a comprehensive ar- ticle on t (Figure 2 2 ). To ensure that the basic information about t is also covered, we add p 0 as “basic fact writer focusing on broadly covering the basic facts about the topic” into P. Each perspec- tive p ∈ P will be utilized to guide the LLM in the process of question asking in parallel. 3.2 Simulating Conversations The theory of questions and question asking (Ram, 1991) highlights that while answers to existing questions contribute to a more comprehensive understanding of a topic, they often simultane- ously give rise to new questions. To kick off this dynamic process, STORM simulates a conversa- tion between a Wikipedia writer and a topic ex- pert. In the i-th round of the conversation, the LLM-powered Wikipedia writer generates a sin- gle question q i based on the topic t, its assigned perspective p ∈ P, and the conversation history {q 1 , a 1 , ..., q i−1 , a i−1 } where a j denotes the sim- ulated expert’s answer. The conversation history enables the LLM to update its understanding of the topic and ask follow-up questions. In practice, we limit the conversation to at most M rounds. To ensure that the conversation history provides factual information, we use trusted sources from the Internet to ground the answer a i to each query q i . Since q i can be complicated, we first prompt the LLM to break down q i into a set of search queries (Figure 2 4 ) and the searched results will be evaluated using a rule-based filter according to

5. the Wikipedia guideline 8 to exclude untrustworthy sources (Figure 2 5 ). Finally, the LLM synthe- sizes the trustworthy sources to generate the answer a i , and these sources will also be added to R for full article generation (§3.4). 3.3 Creating the Article Outline After thoroughly researching the topic through N + 1 simulated conversations, denoted as {C 0 , C 1 , ..., C N }, STORM creates an outline before the actual writing starts. To fully leverage the inter- nal knowledge of LLMs, we first prompt the model to generate a draft outline O D given only the topic t (Figure 2 7 ). O D typically provides a general but organized framework. Subsequently, the LLM is prompted with the topic t, the draft outline O D , and the simulated conversations {C 0 , C 1 , ..., C N } to refine the outline (Figure 2 8 ). This results in an improved outline O which will be used for producing the full-length article. 3.4 Writing the Full-Length Article Building upon the references R collected and the outline O developed during the pre-writing stage, the full-length article can be composed section by section. Since it is usually impossible to fit the entire R within the context window of the LLM, we use the section title and headings of its all-level subsections to retrieve relevant documents from R based on semantic similarity calculated from Sentence-BERT embeddings. With the relevant in- formation at hand, the LLM is then prompted to generate the section with citations. Once all sec- tions are generated, they are concatenated to form the full-length article. Since the sections are gen- erated in parallel, we prompt the LLM with the concatenated article to delete repeated information to improve coherence. Furthermore, in alignment with Wikipedia’s stylistic norms, the LLM is also utilized to synthesize a summary of the entire arti- cle, forming the lead section at the beginning. 4 Experiments 4.1 Article Selection randomly select 100 samples from the FreshWiki dataset (see §2.1) that have human-written articles not exceeding 3000 words. 4.2 Automatic Metrics As discussed in §2.2, we evaluate the outline qual- ity to assess the pre-writing stage by calculating the heading soft recall and heading entity recall. A higher recall score signifies a more comprehensive outline relative to the human-written article. To assess the full-length article quality, we adopt ROUGE scores (Lin, 2004) and compute the entity recall in the article level based on FLAIR NER results. Moreover, based on Wikipedia criteria 9 , we evaluate the article from the aspects of (1) In- terest Level, (2) Coherence and Organization, (3) Relevance and Focus, (4) Coverage, and (5) Verifia- bility. For aspects (1)-(4), we use Prometheus (Kim et al., 2023), a 13B evaluator LLM to score the arti- cle based on a 5-point rubric collaboratively devel- oped with two experienced Wikipedia editors (see Appendix C.2). For verifiability, we calculate the citation recall and citation precision based on the definition in Gao et al. (2023). We use Mistral 7B- Instruct (Jiang et al., 2023a) to examine whether the cited passages entail the generated sentence. 4.3 Baselines As prior works use different setups and do not use LLMs, they are hard to compare directly. Instead, we use the following three LLM-based baselines. 1. Direct Gen, a baseline that directly prompts the LLM to generate an outline, which is then used to generate the full-length article. 2. RAG, a retrieval-augmented generation base- line that searches with the topic and uses the searched results together with the topic t to generate an outline or the entire article. 3. Outline-driven RAG (oRAG), which is iden- tical to RAG in outline creation, but further searches additional information with section titles to generate the article section by section. 4.4 STORM Implementation STORM is capable of researching complicated top- ics and writing long articles from detailed outlines. However, in this controlled experiment, we limit the final output to at most 4000 tokens (roughly 3000 words). For a meaningful comparison, we We build STORM with zero-shot prompting us- ing the DSPy framework (Khattab et al., 2023). Appendix B includes the pseudo code and corre- sponding prompts. The hyperparameters N and M https://en.wikipedia.org/wiki/Wikipedia: Reliable_sources https://en.wikipedia.org/wiki/Wikipedia: Good_article_criteria 8 9

6. Comparsion with Human-written Articles ROUGE-1 ROUGE-L Entity Recall Interest Level Rubric Grading Organization Relevance Coverage Direct Gen RAG oRAG 25.62 28.52 44.26 12.63 13.18 16.51 5.08 7.57 12.57 2.87 3.14 3.90 4.60 4.22 4.79 3.10 3.05 4.09 4.16 4.08 4.70 STORM w/o Outline Stage 45.82 26.77 16.70 12.77 14.10† 7.39 3.99† 3.33 4.82 4.87 4.45† 3.35 4.88† 4.37 Table 2: Results of automatic article quality evaluation. † denotes significant differences (p < 0.05) from a paired t-test between STORM and the best baseline, i.e., oRAG. The rubric grading uses a 1-5 scale. GPT-3.5 GPT-4 Heading Soft Recall Heading Entity Recall Direct Gen RAG/oRAG RAG-expand 80.23 73.59 74.40 32.39 33.85 33.85 STORM w/o Perspective w/o Conversation 86.26† 84.49 77.97 40.52† 40.12 31.98 Direct Gen RAG/oRAG RAG-expand 87.66 89.55 91.36 34.78 42.38 43.53 STORM w/o Perspective w/o Conversation 92.73† 92.39 88.75 45.91 42.70 39.30 Table 3: Results of outline quality evaluation (%). † de- notes significant differences (p < 0.05) from a paired t-test between STORM and baselines. in STORM are both set as 5. We use the chat model gpt-3.5-turbo for question asking and use gpt-3.5-turbo-instruct for other parts of STORM. We also experiment with using gpt-4 for drafting and refining the outline (Figure 2 7 - 8 ). For reported results, the simulated topic expert in STORM is grounded on the You.com search API 10 , although the proposed pipeline is compatible with other search engines. The ground truth Wikipedia article is excluded from the search results. For final article generation, we only report the results using gpt-4 as gpt-3.5 is not faithful to sources when generating text with citations (Gao et al., 2023). We set temperature as 1.0 and top_p as 0.9 for all experiments. 5 Results and Analysis 5.1 Main Results We use outline coverage as a proxy to assess the pre- writing stage (see §2.2). Table 3 shows the heading soft recall and entity recall. Outlines directly gen- erated by LLMs (Direct Gen) already demonstrate https://documentation.you.com/api-reference/ search 10 high heading soft recall, indicating LLMs’ ability to grasp high-level aspects of a topic through their rich parametric knowledge. However, STORM, by asking effective questions to research the topic, can create higher recall outlines that cover more topic- specific aspects. Notably, although RAG leverages additional information, presenting unorganized in- formation in the context window makes outline generation more challenging for the weaker model, i.e., GPT-3.5, leading to worse performance. To test the limit of the RAG baseline, we further expand the retrieved sources by starting with the outline produced by RAG, using its section titles as search queries to collect more sources, and inputting the newly collected sources together with the initial outline to LLM to generate a polished outline. This modified approach is referred to as “RAG-expand” in Table 3. The experiment results indicate that even though having an additional round of search and refinement can improve the outline produced by RAG, our proposed STORM still surpasses its performance. We further evaluate the full-length article quality. As shown in Table 2, oRAG significantly outper- forms RAG, highlighting the effectiveness of using outlines for structuring full-length article genera- tion. Despite this method’s advantages in leverag- ing retrieval and outlining, our approach still out- performs it. The effective question asking mecha- nism enhances the articles with greater entity recall. The evaluator LLM also rates these articles with sig- nificantly higher scores in the aspects of “Interest Level”, “Relevance and Focus”, and “Coverage”. Nonetheless, we acknowledge the possibility of the evaluator LLM overrating machine-generated text. Our careful human evaluation (§6) reveals that STORM still has much room for improvement. Although this work primarily focuses on the pre- writing stage and does not optimize generating text with citations, we still examine the citation quality of articles produced by our approach. As reported

7. STORM Citation Recall Citation Precision 84.83 85.18 Table 4: Citation quality judged by Mistral 7B-Instruct. |R| STORM w/o Perspective w/o Conversation 99.83 54.36 39.56 Table 5: Average number of unique references (|R|) collected using different methods. in Table 4, Mistral 7B-Instruct judges 84.83% of the sentences are supported by their citations. Ap- pendix C.3 investigates the unsupported sentences and reveals that the primary issues stem from draw- ing improper inferences and inaccurate paraphras- ing, rather than hallucinating non-existent contents. 5.2 Ablation Studies As introduced in §3, STORM prompts LLMs to ask effective questions by discovering specific perspectives and simulating multi-turn conversa- tions. We conduct the ablation study on outline creation by comparing STORM with two variants: (1) “STORM w/o Perspective”, which omits per- spective in the question generation prompt; (2) “STORM w/o Conversation”, which prompts LLMs to generate a set number of questions altogether. To ensure a fair comparison, we control an equal total number of generated questions across all variants. Table 3 shows the ablation results and full STORM pipeline produces outlines with the highest recall. Also, “STORM w/o Conversation” gives much worse results, indicating reading relevant informa- tion is crucial to generating effective questions. We further examine how many unique sources are col- lected in R via different variants. As shown in Ta- ble 5, the full pipeline discovers more different sources and the trend is in accord with the auto- matic metrics for outline quality. We also verify whether having an outline stage is necessary with STORM. In Table 2, “STORM w/o Outline Stage” denotes the results of generat- ing the entire article given the topic and the sim- ulated conversations. Removing the outline stage significantly deteriorates the performance across all metrics. 6 Human Evaluation To better understand the strengths and weaknesses of STORM, we conduct human evaluation by col- laborating with 10 experienced Wikipedia editors Avg. Interest Level Organization Relevance Coverage Verifiability #Preferred 3.63 3.25 3.93 3.58 3.85 oRAG ≥ 4 Rates 57.5% 45.0% 62.5% 57.5% 67.5% 14 STORM Av.g. ≥ 4 Rates 4.03 4.00 4.15 4.00 3.80 70.0% 70.0% 65.0% 67.5% 67.5% p-value 0.077 0.005 0.347 0.084 0.843 26 Table 6: Human evaluation results on 20 pairs of articles generated by STORM and oRAG. Each pair of articles is evaluated by two Wikipedia editors. The ratings are given on a scale between 1 and 7, with values ≥ 4 indicating good quality (see Table 10). We conduct paired t-test and report the p-value. who have made at least 500 edits on Wikipedia and have more than 1 year of experience. We randomly sample 20 topics from our dataset and evaluate the articles generated by our method and oRAG, the best baseline according to the automatic evaluation. Each pair of articles is assigned to 2 editors. We request editors to judge each article from the same five aspects defined in §4.2, but using a 1 to 7 scale for more fine-grained evaluation. While our automatic evaluation uses citation quality as a proxy to evaluate Verifiability, we stick to the Wikipedia standard of “verifiable with no original research” in human evaluation. Besides rating the articles, editors are asked to provide open-ended feedback and pairwise preference. After the evalua- tion finishes, they are further requested to compare an article produced by our method, which they have just reviewed, with its human-written counterpart, and report their perceived usefulness of STORM using a 1-5 Likert scale. More human evaluation de- tails are included in Appendix D. Table 6 presents the rating and pairwise comparison results. 11 Articles produced by STORM exhibit greater breadth and depth than oRAG outputs. In ac- cord with the finding in §5.1, editors judge articles produced by STORM as more interesting, orga- nized, and having broader coverage compared to oRAG outputs. Specifically, 25% more articles pro- duced by STORM are considered organized (Orga- nization rating ≥ 4), and 10% more are deemed to have good coverage (Coverage rating ≥ 4). Even in comparison with human-written articles, one editor praises our result as providing “a bit more 11 For the 1-7 scale rating results on each criterion, we cal- culate the Krippendorff’s Alpha to measure the inter annotator agreement (IAA), and the results are as follows: Interest Level (0.349), Organization (0.221), Relevance (0.256), Coverage (0.346), Verifiability (0.388).

8. Strongly Disagree Somewhat Disagree Neutral Somewhat Agree I think it can be specifically helpful for my pre-writing stage. Strongly Agree 70% I think it will help me edit a Wikipedia article for a new topic. 20% I think it can be a potentially useful 10% tool for the Wikipedia community. 20% 50% 60% 30% 30% 10% Figure 3: Survey results of the perceived usefulness of STORM (n = 10). background information” and another notes that “I found that the AI articles had more depth compared to the Wikipedia articles”. STORM also outper- forms the best baseline in pairwise comparison. More information in |R| poses challenges be- yond factual hallucination. We examine 14 pair- wise comparison responses where editors prefer oRAG outputs over STORM. Excluding 3 cases where pairwise preferences do not align with their ratings, editors assign lower Verifiability scores to articles from our approach in over 50% of the cases. Through analyzing the articles and editors’ free- form feedback, we discover that low Verifiability scores stem from red herring fallacy or overspec- ulation issues. These arise when the generated articles introduce unverifiable connections between different pieces of information in |R| or between the information and the topic (examples included in Table 11). Compared to the widely discussed factual hallucination (Shuster et al., 2021; Huang et al., 2023), addressing such verifiability issues is more nuanced, surpassing basic fact-checking (Min et al., 2023). Generated articles trail behind well-revised hu- man works. While STORM outperforms the oRAG baseline, editors comment that the generated articles are less informative than actual Wikipedia pages. Another major issue identified is the trans- fer of bias and tone from Internet sources to the generated article, with 7 out of 10 editors men- tioning that the STORM-generated articles sound “emotional” or “unneutral”. More analysis is dis- cussed in Appendix E. This feedback suggests that reducing the retrieval bias in the pre-writing stage is a worthwhile direction for future work. Generated articles are a good starting point. As shown in Figure 3, editors are unanimous in agree- ing that STORM can aid them in their pre-writing stage. It is gratifying to know that the tool is help- ful to experienced editors. 80% of the editors think that STORM can help them edit a Wikipedia article for a new topic. More reservation is expressed to the usefulness of STORM for the Wikipedia com- munity at large; nonetheless, 70% of the editors think it is useful, with only 10% disagreeing. 7 Related Works Retrieval-Augmented Generation (RAG) Aug- menting language models (LMs) with retrieval at inference time is a typical way to leverage exter- nal knowledge stores (Ram et al., 2023; Izacard et al., 2023). While some works use retrieval to construct demonstrations for in-context learn- ing (Li et al., 2023; Liu et al., 2022; Agrawal et al., 2023; Poesia et al., 2022; Shi et al., 2022; Khattab et al., 2022), another line of works uses retrieval to provide additional information for LMs to ground on. Lewis et al. (2020) study RAG on knowledge- intensive NLP tasks and find it improves diver- sity and factuality. Semnani et al. (2023) de- signs a RAG-based chatbot grounded on English Wikipedia to stop LLM-based chatbots from hal- lucination. Besides, RAG can be used to generate text with citations (Menick et al., 2022; Gao et al., 2023) and build attributed question answering sys- tems (Bohnet et al., 2023). While RAG is widely studied in question answering, how to use it for long-form article generation is less investigated. As a general framework, RAG is flexible in both the retrieval source and time. The retrieval sources can vary from domain databases (Zakka et al., 2023), code documentation (Zhou et al., 2023), to the whole Internet (Nakano et al., 2022; Komeili et al., 2022). Regarding the time, besides a one- time retrieval before generation, the system can be designed to self-decide when to retrieve across the course of the generation (Jiang et al., 2023b; Parisi et al., 2022; Shuster et al., 2022; Yao et al., 2023). Automatic Expository Writing Different from other types of long-form generation (Yang et al., 2022; Feng et al., 2018), automatic expository writ- ing requires grounding on external documents and leveraging the interplay between reading and writ- ing. Balepur et al. (2023) propose the Imitate- Retrieve-Paraphrase framework for expository writ- ing at the paragraph level to address the challenges in synthesizing information from multiple sources. Beyond summarizing sources, Shen et al. (2023) highlight that expository writing requires the au- thor’s sensemaking process over source documents and good outline planning. We tackle these chal- lenges by focusing on the pre-writing stage.

9. Question Asking in NLP Question asking capa- bilities in NLP systems have expanded across sev- eral fronts, including generating clarification ques- tions to understand user intents (Aliannejadi et al., 2019; Rahmani et al., 2023), and breaking large questions into smaller ones to improve composi- tional reasoning (Press et al., 2023). While humans usually ask questions to learn new knowledge (Taw- fik et al., 2020; Booth et al., 2003), how to opti- mize question informativeness and specificity in information-seeking conversations remains less ex- plored. The closest work is Qi et al. (2020) which defines the question informativeness using the un- igram precision function and uses reinforcement learning to increase the question informativeness. 8 Conclusion We propose STORM, an LLM-based writing sys- tem that automates the pre-writing stage for creat- ing Wikipedia-like articles from scratch. We cu- rate the FreshWiki dataset and establish evaluation criteria to study the generation of grounded long- form articles. Experimental results demonstrate that the question asking mechanism in STORM improves both the outline and article quality. With the improved breadth and depth, STORM helps surface new challenges for grounded writing sys- tems through expert evaluation. The experienced Wikipedia editors in our study unanimously agree that STORM is helpful for their pre-writing stage. Limitations In this work, we explore generating Wikipedia- like articles from scratch as a way to push the frontier of automatic expository writing and long- form article generation. While our approach sig- nificantly outperforms baseline methods in both automatic and human evaluations, the quality of machine-written articles still lags behind well- revised human-authored articles, specifically in aspects of neutrality and verifiability. Although STORM discovers different perspectives in re- searching the given topic, the collected information may still be biased towards dominant sources on the Internet and may contain promotional content. Moreover, the verifiability issues identified in this work go beyond factual hallucination, which high- lights new challenges to grounded writing systems. Another limitation of this work is that although we focus on the task of generating Wikipedia-like articles from scratch, our task setup is still simpli- fied to only consider the generation of free-form text. Human-authored high-quality Wikipedia ar- ticles usually contain structured data and multi- modal information. We leave the exploration of generating multi-modal grounded articles for fu- ture work. Acknowledgements We thank You.com for generously providing the search API that supported our experiments. We also thank Sina J. Semnani, Shicheng Liu, Eric Ze- likman for providing helpful feedback and the ACL ARR reviewers for their valuable comments. This work is supported in part by the Verdant Founda- tion and Microsoft Azure AI credits. Yijia Shao is supported by a Stanford School of Engineering Fellowship. Ethics Statement Different from the creative generation, grounded ar- ticle generation may impact how people learn about topics or consume source information. All the stud- ies and the evaluation in this work are designed to prevent the dissemination of misinformation by not publishing generated content online and im- plementing strict accuracy checks. We avoid any disruption to Wikipedia or related communities, as our system does not interact with live pages. Also, although we try to generate grounded articles, we believe there is no privacy issue related to this work as we only use information publicly available on the Internet. The primary risk of our work is that the Wikipedia articles written by our system are grounded on information on the Internet which contains some biased or discriminative content on its own. Currently, our system relies on the search engine to retrieve information but does not include any post-processing module. We believe improv- ing the retrieval module to have good coverage of different viewpoints and adding a content sifting module to the current system will be a critical next step to achieve better neutrality and balance in the generated articles. Another limitation we see from an ethical point of view is that we only consider writing English Wikipedia articles in this work. Extending the cur- rent system to a multilingual setup is a meaningful direction for future work as more topics do not have Wikipedia pages in non-English languages.

10. References Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Zettlemoyer, and Marjan Ghazvininejad. 2023. In- context examples selection for machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8857–8873, Toronto, Canada. Association for Computational Linguistics. Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. FLAIR: An easy-to-use framework for state-of-the- art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59, Minneapolis, Minnesota. Association for Computational Linguistics. Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. 2019. Asking clari- fying questions in open-domain information-seeking conversations. In Proceedings of the 42nd interna- tional acm sigir conference on research and develop- ment in information retrieval, pages 475–484. Nishant Balepur, Jie Huang, and Kevin Chang. 2023. Expository text generation: Imitate, retrieve, para- phrase. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 11896–11919, Singapore. Association for Computational Linguistics. Siddhartha Banerjee and Prasenjit Mitra. 2015. WikiKreator: Improving Wikipedia stubs automat- ically. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguis- tics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Pa- pers), pages 867–877, Beijing, China. Association for Computational Linguistics. Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aha- roni, Daniel Andor, Livio Baldini Soares, Massimil- iano Ciaramita, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma, Jianmo Ni, Lierni Sestorain Saralegui, Tal Schus- ter, William W. Cohen, Michael Collins, Dipanjan Das, Donald Metzler, Slav Petrov, and Kellie Webster. 2023. Attributed question answering: Evaluation and modeling for attributed large language models. Wayne C Booth, Gregory G Colomb, and Joseph M Williams. 2003. The craft of research. University of Chicago press. Laura Dietz and John Foley. 2019. Trec car y3: Com- plex answer retrieval overview. In Proceedings of Text REtrieval Conference (TREC). Christina S Doyle. 1994. Information literacy in an information society: A concept for the information age. Diane Publishing. Ann-Marie Eriksson and Åsa Mäkitalo. 2015. Supervi- sion at the outline stage: Introducing and encounter- ing issues of sustainable development through aca- demic writing assignments. Text & Talk, 35(2):123– 153. Angela Fan and Claire Gardent. 2022. Generating bi- ographies on Wikipedia: The impact of gender bias on the retrieval-based generation of women biogra- phies. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 8561–8576, Dublin, Ireland. Association for Computational Linguistics. Xiaocheng Feng, Ming Liu, Jiahao Liu, Bing Qin, Yibo Sun, and Ting Liu. 2018. Topic-to-essay generation with neural networks. In IJCAI, pages 4078–4084. Tira Nur Fitria. 2023. Artificial intelligence (ai) tech- nology in openai chatgpt application: A review of chatgpt in writing english essay. In ELT Forum: Jour- nal of English Language Teaching, volume 12, pages 44–58. Pasi Fränti and Radu Mariescu-Istodor. 2023. Soft preci- sion and recall. Pattern Recognition Letters, 167:115– 121. R Edward Freeman, Jeffrey S Harrison, Andrew C Wicks, Bidhan L Parmar, and Simone De Colle. 2010. Stakeholder theory: The state of the art. Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling large language models to generate text with citations. In Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore. Associa- tion for Computational Linguistics. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2023. A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, and open questions. Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi- Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43. Albert Q Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, et al. 2023a. Mistral 7b. arXiv preprint arXiv:2310.06825. Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023b. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, pages 7969–7992, Singapore. As- sociation for Computational Linguistics.

11. Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2023. Large language models struggle to learn long-tail knowledge. In In- ternational Conference on Machine Learning, pages 15696–15707. PMLR. Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search- predict: Composing retrieval and language mod- els for knowledge-intensive NLP. arXiv preprint arXiv:2212.14024. Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Za- haria, and Christopher Potts. 2023. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714. Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. 2023. Prometheus: Inducing fine-grained evalua- tion capability in language models. arXiv preprint arXiv:2310.08491. Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022. Internet-augmented dialogue generation. In Proceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 8460–8478, Dublin, Ireland. Association for Computational Linguistics. Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, and Kyle Lo. 2023. LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceed- ings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 1650–1669, Dubrovnik, Croatia. Association for Computational Linguistics. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein- rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neu- ral Information Processing Systems, 33:9459–9474. Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Xiaoling Wang, and Xipeng Qiu. 2023. Unified demonstration retriever for in- context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 4644–4668, Toronto, Canada. Association for Computational Lin- guistics. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. In Text Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extrac- tion and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Associa- tion for Computational Linguistics. Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summariz- ing long sequences. In International Conference on Learning Representations. Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell- Gillingham, Geoffrey Irving, and Nat McAleese. 2022. Teaching language models to support answers with verified quotes. Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettle- moyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singa- pore. Association for Computational Linguistics. Julià Minguillón, Maura Lerga, Eduard Aibar, Josep Lladós-Masllorens, and Antoni Meseguer-Artola. 2017. Semi-automatic generation of a corpus of wikipedia articles on science and technology. Profe- sional de la Información, 26(5):995–1005. Rosa Munoz-Luna. 2015. Main ingredients for suc- cess in l2 academic writing: Outlining, drafting and proofreading. PloS one, 10(6):e0128309. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. 2022. Webgpt: Browser- assisted question-answering with human feedback. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instruc- tions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744. Aaron Parisi, Yao Zhao, and Noah Fiedel. 2022. Talm: Tool augmented language models. John V Pavlik. 2023. Collaborating with chatgpt: Con- sidering the implications of generative artificial intel- ligence for journalism and media education. Journal- ism & Mass Communication Educator, 78(1):84–93.

12. Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gul- wani. 2022. Synchromesh: Reliable code generation from pre-trained language models. In International Conference on Learning Representations. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. 2023. Measuring and narrowing the compositionality gap in language mod- els. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singa- pore. Association for Computational Linguistics. Peng Qi, Yuhao Zhang, and Christopher D. Manning. 2020. Stay hungry, stay focused: Generating infor- mative and specific questions in information-seeking conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 25– 40, Online. Association for Computational Linguis- tics. Hongjing Qian, Yutao Zhu, Zhicheng Dou, Haoqi Gu, Xinyu Zhang, Zheng Liu, Ruofei Lai, Zhao Cao, Jian-Yun Nie, and Ji-Rong Wen. 2023. Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus. Hossein A. Rahmani, Xi Wang, Yue Feng, Qiang Zhang, Emine Yilmaz, and Aldo Lipani. 2023. A survey on asking clarification questions datasets in conversa- tional systems. In Proceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 2698–2716, Toronto, Canada. Association for Computational Lin- guistics. Ashwin Ram. 1991. A theory of questions and question asking. Journal of the Learning Sciences, 1(3-4):273– 318. Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented lan- guage models. Transactions of the Association for Computational Linguistics. Nils Reimers and Iryna Gurevych. 2019. Sentence- BERT: Sentence embeddings using Siamese BERT- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Com- putational Linguistics. D Gordon Rohman. 1965. Pre-writing the stage of dis- covery in the writing process. College composition and communication, 16(2):106–112. Christina Sauper and Regina Barzilay. 2009. Auto- matically generating Wikipedia articles: A structure- aware approach. In Proceedings of the Joint Con- ference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 208–216, Suntec, Singapore. Association for Computational Linguistics. Sina Semnani, Violet Yao, Heidi Zhang, and Monica Lam. 2023. WikiChat: Stopping the hallucination of large language model chatbots by few-shot ground- ing on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2387–2413, Singapore. Association for Computa- tional Linguistics. Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo, Jonathan Bragg, Jeff Hammerbacher, Doug Downey, Joseph Chee Chang, and David Sontag. 2023. Be- yond summarization: Designing ai support for real- world expository writing tasks. Weijia Shi, Julian Michael, Suchin Gururangan, and Luke Zettlemoyer. 2022. Nearest neighbor zero-shot inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3254–3265, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, and Jason Weston. 2022. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 373–393, Abu Dhabi, United Arab Emirates. Associ- ation for Computational Linguistics. Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Do- minican Republic. Association for Computational Linguistics. Christine M Tardy. 2010. Writing for the world: Wikipedia as an introduction to academic writing. In English teaching forum, volume 48, page 12. ERIC. Andrew A Tawfik, Arthur Graesser, Jessica Gatewood, and Jaclyn Gishbaugher. 2020. Role of questions in inquiry-based instruction: towards a design taxon- omy for question-asking and implications for design. Educational Technology Research and Development, 68:653–678. Charles A Weaver III and Walter Kintsch. 1991. Expos- itory text. Karsten Wenzlaff and Sebastian Spaeth. 2022. Smarter than humans? validating how openai’s chatgpt model explains crowdfunding, alternative finance and com- munity finance. Validating how OpenAI’s ChatGPT model explains Crowdfunding, Alternative Finance and Community Finance.(December 22, 2022). Fangyuan Xu, Yixiao Song, Mohit Iyyer, and Eunsol Choi. 2023. A critical evaluation of evaluations for long-form question answering. In Proceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3225–3245, Toronto, Canada. Association for Com- putational Linguistics.

13. Kevin Yang, Dan Klein, Nanyun Peng, and Yuandong Tian. 2023. DOC: Improving long story coherence with detailed outline control. In Proceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3378–3465, Toronto, Canada. Association for Com- putational Linguistics. Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein. 2022. Re3: Generating longer stories with recursive reprompting and revision. In Proceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, pages 4393–4479, Abu Dhabi, United Arab Emirates. Association for Com- putational Linguistics. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations. Cyril Zakka, Akash Chaurasia, Rohan Shad, Alex R Dalal, Jennifer L Kim, Michael Moor, Kevin Alexan- der, Euan Ashley, Jack Boyd, Kathleen Boyd, et al. 2023. Almanac: Retrieval-augmented language mod- els for clinical medicine. Research Square. Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao Jiang, and Graham Neubig. 2023. Docprompting: Gener- ating code by retrieving the docs. In The Eleventh International Conference on Learning Representa- tions.

14. Average Numer of Sections Average Number of All-level Headings 8.4 15.8 Average Length of a Section Average Length of Total Article 327.8 2159.1 Average Number of References 90.1 Table 7: Statistics of the dataset used in our experiments. Figure 5: Distribution of edit counts for Wikipedia arti- cles in our experiments (n = 100). B Figure 4: Evolution of reference count in the Wikipedia article editing process. A Dataset Details Pseudo Code of STORM In §3, we introduce STORM, a framework that au- tomates the pre-writing stage by discovering differ- ent perspectives, simulating information-seeking conversations, and creating a comprehensive out- line. Algorithm 1 displays the skeleton of STORM. We implement STORM with zero-shot prompt- ing using the DSPy framework (Khattab et al., 2023). Listing 1 and 2 show the prompts used in our implementation. We highlight that STORM offers a general framework designed to assist the creation of grounded, long-form articles, without depending extensively on prompt engineering for a single domain. As discussed in §2.1, we curate the FreshWiki dataset by collecting recent and high-quality En- glish Wikipedia articles. We select the most-edited pages over a specific period rather than using cre- ation dates as a cutoff because most of Wikipedia articles are “stubs” or are of low quality when they were created. For quality, we consider articles pre- dicted to be of B-class quality or above. According to Wikipedia statistics 12 , only around 3% of ex- isting Wikipedia pages meet this quality standard. As LLMs can generate reasonably good outputs, we think it is important to use high-quality human- written articles as references for further research. For experiments in this work, we randomly se- lect 100 samples with human-written articles un- der 3000 words to have a meaningful comparison. Table 7 gives the data statistics. Notably, human- authored articles have a large number of references but they require numerous edits to achieve this. Fig- ure 4 illustrates the evolution of the reference count in the article edit process and Figure 5 gives the dis- tribution of edit counts for human-authored articles used in our experiments. We calculate the soft heading recall between the multi-level headings in the generated outline, con- sidered as the prediction P , and those in the human- written article, considered as the ground truth G. The calculation is based on the soft recall defini- tion in Fränti and Mariescu-Istodor (2023). Given a set A = {Ai} K i=1 , soft count of an item is defined as the inverse of the sum of its similarity to other items in the set: https://en.wikipedia.org/wiki/Wikipedia: Content_assessment https://huggingface.co/sentence-transformers/ paraphrase-MiniLM-L6-v2 12 C Automatic Evaluation Details C.1 Soft Heading Recall count (A i ) = P K 1 j=1 Sim (A i , A j ) (1) Sim (A i , A j ) = cos (embed(A i ), embed(A j )) , where embed(·) in Equation (1) is parameterized by paraphrase-MiniLM-L6-v2 provided in the Sentence-Transformers library 13 . The cardinality 13

15. 1 2 3 4 5 class GenRelatedTopicsPrompt ( dspy . Signature ): """ I 'm writing a Wikipedia page for a topic mentioned below . Please identify and recommend some Wikipedia pages on closely related subjects . I 'm looking for examples that provide insights into interesting aspects commonly associated with this topic , or examples that help me understand the typical content and structure included in Wikipedia pages for similar topics . Please list the urls in separate lines . """ 6 7 8 topic = dspy . InputField ( prefix =" Topic of interest :" , format = str ) related_topics = dspy . OutputField () 9 10 11 12 13 14 class GenPerspectivesPrompt ( dspy . Signature ): """ You need to select a group of Wikipedia editors who will work together to create a comprehensive article on the topic . Each of them represents a different perspective , role , or affiliation related to this topic . You can use other Wikipedia pages of related topics for inspiration . For each editor , add description of what they will focus on . Give your answer in the following format : 1. short summary of editor 1: description \ n2 . short summary of editor 2: description \n ... """ 15 16 17 18 topic = dspy . InputField ( prefix = ' Topic of interest : ' , format = str ) examples = dspy . InputField ( prefix = ' Wiki page outlines of related topics for inspiration :\ n ' , format = str ) perspectives = dspy . OutputField () 19 20 21 22 23 24 25 26 class GenQnPrompt ( dspy . Signature ): """ You are an experienced Wikipedia writer and want to edit a specific page . Besides your identity as a Wikipedia writer , you have a specific focus when researching the topic . Now , you are chatting with an expert to get information . Ask good questions to get more useful information . When you have no more question to ask , say " Thank you so much for your help !" to end the conversation . Please only ask one question at a time and don 't ask what you have asked before . Your questions should be related to the topic you want to write . """ 27 28 29 30 31 topic = dspy . InputField ( prefix = ' Topic you want to write : ' , format = str ) persona = dspy . InputField ( prefix = ' Your specific perspective : ' , format = str ) conv = dspy . InputField ( prefix = ' Conversation history :\ n ' , format = str ) question = dspy . OutputField () 32 33 34 35 36 37 class GenQueriesPrompt ( dspy . Signature ): """ You want to answer the question using Google search . What do you type in the search box ? Write the queries you will use in the following format :- query 1\n - query 2\ n ... """ 38 39 40 41 topic = dspy . InputField ( prefix = ' Topic you are discussing about : ' , format = str ) question = dspy . InputField ( prefix = ' Question you want to answer : ' , format = str ) queries = dspy . OutputField () Listing 1: Prompts used in STORM, corresponding to Line 4, 11, 19, 22 in Algorithm 1.

16. 1 2 3 4 5 class GenAnswerPrompt ( dspy . Signature ): """ You are an expert who can use information effectively . You are chatting with a Wikipedia writer who wants to write a Wikipedia page on topic you know . You have gathered the related information and will now use the information to form a response . Make your response as informative as possible and make sure every sentence is supported by the gathered information . """ 6 7 8 9 10 11 topic = dspy . InputField ( prefix = ' Topic you are discussing about : ' , format = str ) conv = dspy . InputField ( prefix = ' Question :\ n ' , format = str ) info = dspy . InputField ( prefix = ' Gathered information :\ n ' , format = str ) answer = dspy . OutputField ( prefix = ' Now give your response :\ n ') 12 13 14 15 16 17 18 19 20 class DirectGenOutlinePrompt ( dspy . Signature ): """ Write an outline for a Wikipedia page . Here is the format of your writing : 1. Use "#" Title " to indicate section title , "##" Title " to indicate subsection title , "###" Title " to indicate subsubsection title , and so on . 2. Do not include other information . """ 21 22 23 topic = dspy . InputField ( prefix =" Topic you want to write : ", format = str ) outline = dspy . OutputField ( prefix =" Write the Wikipedia page outline :\ n ") 24 25 26 27 28 29 30 31 32 class RefineOutlinePrompt ( dspy . Signature ): """ Improve an outline for a Wikipedia page . You already have a draft outline that covers the general information . Now you want to improve it based on the information learned from an information - seeking conversation to make it more comprehensive . Here is the format of your writing : 1. Use "#" Title " to indicate section title , "##" Title " to indicate subsection title , "###" Title " to indicate subsubsection title , and so on . 2. Do not include other information . """ 33 34 35 36 37 38 topic = dspy . InputField ( prefix =" Topic you want to write : ", format = str ) conv = dspy . InputField ( prefix =" Conversation history :\ n", format = str ) old_outline = dspy . OutputField ( prefix =" Current outline :\ n", format = str ) outline = dspy . OutputField ( prefix = ' Write the Wikipedia page outline :\ n ') Listing 2: Prompts used in STORM (continue), corresponding to Line 24, 31, 32 in Algorithm 1.

17. of A is the sum of the counts of its individual items: card(A) = K X count (A i ) (2) i=1 The soft heading recall is calculated as Algorithm 1: STORM Input :Topic t, maximum perspective N , maximum conversation round M Output :Outline O, references R 1 P0 = "basic fact writer ..." // Constant. 2 R ← [] 3 // Discover perspectives P. 4 related_topics ← gen_related_topics(t) 5 tocs ← [ ] 6 foreach related_t in related_topics do 7 article ← get_wiki_article(related_t) if article then 8 tocs.append(extract_toc(article)) 9 end 10 end 11 P ← gen_perspectives(t, tocs) 12 P ← [P0] + P[:N ] 13 // Simulate conversations. 14 convos ← [ ] 15 foreach p in P do 16 convo_history ← [ ] 17 for i = 1 to M do 18 // Question asking. 19 q ← gen_qn(t, p, dlg_history) 20 convo_history.append(q) 21 // Question answering. 22 queries ← gen_queries(t, q) 23 sources ← search_and_sift(queries) 24 a ← gen_ans(t, q, sources) 25 convo_history.append(a) 26 R.append(sources) 27 end 28 convos.append(convo_history) 29 end 30 // Create the outline. 31 O D ← direct_gen_outline(t) 32 O ← refine_outline(t, O D , convos) 33 return O, R soft heading recall = card(G ∩ P ) , card(G) (3) where the cardinality of intersection is defined via the union as follows: card(G ∩ P ) = card(G) + card(P ) − card(G ∪ P ). C.2 (4) LLM Evaluator We use Prometheus 14 (Kim et al., 2023), a 13B open-source evaluator LLM that can assess long- form text based on customized 1-5 scale rubric, to grade the article from the aspects of Interest level, Coherence and Organization, Relevance and Fo- cus, and Coverage. Table 8 gives our grading rubric. While Prometheus is best used with a score 5 ref- erence answer, we find adding the reference will exceed the context length limit of the model. Since Kim et al. (2023) show Prometheus ratings without reference also correlate well with human prefer- ences, we omit the reference and trim the input article to be within 2000 words by iteratively re- moving contents from the shortest section to ensure the input can fit into the model’s context window. C.3 More Discussion of the Citation Quality Irrelevant Source Others 4% 1% Inaccurate Paraphrasing 7% Improper Inferential Linking 14% Lack Citation 47% Incorrectly Split 12% False Negative 15% Figure 6: Error analysis of unsupported sentences in 10 sampled articles. https://huggingface.co/kaist-ai/ prometheus-13b-v1.0 14

18. Criteria Description Score 1 Description Score 2 Description Score 3 Description Score 4 Description Score 5 Description Interest Level: How engaging and thought-provoking is the article? Not engaging at all; no attempt to capture the reader’s attention. Fairly engaging with a basic narrative but lacking depth. Moderately engaging with several interesting points. Quite engaging with a well-structured narrative and noteworthy points that frequently capture and retain attention. Exceptionally engaging throughout, with a compelling narrative that consistently stimulates interest. Criteria Description Score 1 Description Score 2 Description Score 3 Description Score 4 Description Score 5 Description Coherence and Organization: Is the article well-organized and logically structured? Disorganized; lacks logical structure and coherence. Fairly organized; a basic structure is present but not consistently followed. Organized; a clear structure is mostly followed with some lapses in coherence. Good organization; a clear structure with minor lapses in coherence. Excellently organized; the article is logically structured with seamless transitions and a clear argument. Criteria Description Score 1 Description Score 2 Description Score 3 Description Score 4 Description Relevance and Focus: Does the article stay on topic and maintain a clear focus? Off-topic; the content does not align with the headline or core subject. Somewhat on topic but with several digressions; the core subject is evident but not consistently adhered to. Generally on topic, despite a few unrelated details. Mostly on topic and focused; the narrative has a consistent relevance to the core subject with infrequent digressions. Exceptionally focused and entirely on topic; the article is tightly centered on the subject, with every piece of information contributing to a comprehensive understanding of the topic. Score 5 Description Criteria Description Score 1 Description Score 2 Description Score 3 Description Score 4 Description Score 5 Description Broad Coverage: Does the article provide an in-depth exploration of the topic and have good coverage? Severely lacking; offers little to no coverage of the topic’s primary aspects, resulting in a very narrow perspective. Partial coverage; includes some of the topic’s main aspects but misses others, resulting in an incomplete portrayal. Acceptable breadth; covers most main aspects, though it may stray into minor unnecessary details or overlook some relevant points. Good coverage; achieves broad coverage of the topic, hitting on all major points with minimal extraneous information. Exemplary in breadth; delivers outstanding coverage, thoroughly detailing all crucial aspects of the topic without including irrelevant information. Table 8: Scoring rubrics on a 1-5 scale for the evaluator LLM. Error Type Topic Unsupported Sentence Source Improper Inferential Linking Lahaina, Hawaii Throughout its history, religion has remained the paramount aspect of Hawaiian life in Lahaina , permeating every daily activity and significant event[5]. [5] “Religion, Beliefs & Spirituality” (The source discusses religion as part of Hawaiian life but does not mention Lahania .) Inaccurate Paraphrasing 2022 Crimean Bridge explosion Completed in June 2020 , the bridge serves as a major supply route for Russian forces in the region and is significant to Russia’s claim over the disputed territory[2][11]. [2] “Crimean Bridge - Wikipedia” (The source says “The first scheduled passenger train crossed the bridge on 25 December 2019, while the bridge was opened for freight trains on 30 June 2020 ”.) Citing Irrelevant Sources LK-99 For example, comparisons have been drawn between the performance of LK-9 and the dynamic resolution capabilities of video games such as Battlefield 2042[22]. [22] “Battlefield 2042 PC performance guide: The best settings for a high frame rate” ( The source is irrelevant to LK-99. ) Table 9: Examples of different error types of unsupported sentences.

19. We use Mistral 7B-Instruct 15 (Jiang et al., 2023a) to examine whether the cited passages entail the generated sentence. Table 4 reports the citation quality of articles produced by our approach, show- ing that around 15% sentences in generated articles are unsupported by citations. We further investi- gate the failure cases by randomly sampling 10 articles and an author manually examines all the unsupported sentences in these articles. Besides sentences that are incorrectly split 16 , lack citations, or are deemed supported by the author’s judgment, our analysis identifies three main error categories (examples are given in Table 9): improper inferen- tial linking, inaccurate paraphrasing, and citing irrelevant sources. We show the error distribution in Figure 6. No- tably, the most common errors stem from the ten- dency of LLMs to form improper inferential links between different pieces of information presented in the context window. Our analysis of citation quality suggests that, in addition to avoiding hallu- cinations, future research in grounded text gener- ation should also focus on preventing LLMs from making overly inferential leaps based on the pro- vided information. D Human Evaluation Details We recruited 10 experienced Wikipedia editors to participate in our study by creating a research page on Meta-Wiki 17 and reaching out to active editors who have recently approved articles for Wikipedia. 18 Our participation group includes 3 editors with 1-5 years of experience, 4 with 6-10 years, and 3 with over 15 years of contribution. The study was approved by the Institutional Re- view Board of our institution and the participants signed the consent form through Qualtrics ques- tionnaires before the study started. To streamline the evaluation of grounded articles, we developed a web application, which features a side-by-side display of the article and its citation snippets, to gather ratings and open-ended feedback https://huggingface.co/mistralai/ Mistral-7B-Instruct-v0.1 16 Following Gao et al. (2023), we check citation quality in the sentence level and split articles into sentences using NLTK sent_tokenize. sent_tokenize sometimes fails to split sen- tences correctly when the article contains special words like “No.12847”, “Bhatia et al.”, etc. 17 https://meta.wikimedia.org 18 Since evaluating Wikipedia-like articles is time- consuming and requires expertise, we paid each participant 50$ for our study. 15 for each article. Figure 7 shows the screenshot of our web application and the full article produced by STORM is included in Table 12. For human evaluation, we use a 1 to 7 scale for more fine- grained evaluation. The grading rubric is included in Table 10. We collected the pairwise preferences and the perceived usefulness of STORM via an online ques- tionnaire. Specifically, for the perceived usefulness, we request editors to rate their agreement with state- ments “I think it can be specifically helpful for my pre-writing stage (e.g., collecting relevant sources, outlining, drafting).”, “I think it will help me edit a Wikipedia article for a new topic”, “I think it can be a potentially useful tool for the Wikipedia community” on a Likert scale of 1-5, correspond- ing to Strongly disagree, Somewhat disagree, Nei- ther agree nor disagree, Somewhat agree, Strongly agree. E Error Analysis While articles produced by STORM are preferred by both automatic metrics and human evaluation, experienced editors still identified multiple prob- lems with the machine-generated articles. We an- alyze the free-form comments and summarize the major issues in Table 11. The primary issue raised is that the generated articles often contain emotional language and lack neutrality, primarily due to the source material. STORM currently retrieves grounding sources from the Internet which is not neutral and con- tains considerable promotional content on its own. Addressing this bias in the pre-writing stage repre- sents a valuable direction for future research. An- other major issue is the red herring fallacy or the over-association of unrelated facts. Addressing this challenge calls for high-level sensemaking rather than mere fact-level verification.

20. Interest Level 1: Not engaging at all; no attempt to capture the reader’s attention. 2: Slightly engaging with rare moments that capture attention. 3: Fairly engaging with a basic narrative but lacking depth. 4: Moderately engaging with several interesting points. 5: Quite engaging with a well-structured narrative and noteworthy points that frequently capture and retain attention. 6: Very engaging with a compelling narrative that captures and mostly retains attention. 7: Exceptionally engaging throughout, with a compelling narrative that consistently stimulates interest. Coherence and Organization 1: Disorganized; lacks logical structure and coherence. 2: Poor organization; some structure is evident but very weak. 3: Fairly organized; a basic structure is present but not consistently followed. 4: Organized; a clear structure is mostly followed with some lapses in coherence. 5: Good organization; a clear structure with minor lapses in coherence. 6: Very well-organized; a logical structure with transitions that effectively guide the reader. 7: Excellently organized; the article is logically structured with seamless transitions and a clear argument. Relevance and Focus 1: Off-topic; the content does not align with the headline or core subject. 2: Mostly off-topic with some relevant points. 3: Somewhat on topic but with several digressions; the core subject is evident but not consistently adhered to. 4: Generally on topic, despite a few unrelated details. 5: Mostly on topic and focused; the narrative has a consistent relevance to the core subject with infrequent digressions. 6: Highly relevant with a focused narrative and purpose. 7: Exceptionally focused and entirely on topic; the article is tightly centered on the subject, with every piece of information contributing to a comprehensive understanding of the topic. Broad Coverage 1: Severely lacking; offers little to no coverage of the topic’s primary aspects, resulting in a very narrow perspective. 2: Minimal coverage; addresses only a small selection of the topic’s main aspects, with significant omissions. 3: Partial coverage; includes some of the topic’s main aspects but misses others, resulting in an incomplete portrayal. 4: Acceptable breadth; covers most main aspects, though it may stray into minor unnecessary details or overlook some relevant points. 5: Good coverage; achieves broad coverage of the topic, hitting on all major points with minimal extraneous information. 6: Comprehensive; provides thorough coverage of all significant aspects of the topic, with a well-balanced focus. 7: Exemplary in breadth; delivers outstanding coverage, thoroughly detailing all crucial aspects of the topic without including irrelevant information. Verifiability 1: No supporting evidence; claims are unsubstantiated. 2: Rarely supported with evidence; many claims are unsubstantiated. 3: Inconsistently verified; some claims are supported; evidence is occasionally provided. 4: Generally verified; claims are usually supported with evidence; however, there might be a few instances where verification is lacking 5: Well-supported; claims are very well supported with credible evidence, and instances of unsupported claims are rare. 6: Very well-supported; almost every claim is substantiated with credible evidence, showing a high level of thorough verification. 7: Exemplary verification; each claim is supported by robust, credible evidence from authoritative sources, reflecting strict adherence to the no original research policy. Table 10: Scoring rubrics on a 1-7 scale for human evaluation.

21. Issue Mentioned Time Example Comments The word “significant” is used 17 times in this article. Vague and unsupported claims are made about broader political importance and “pivotal role[s]”, and is unencyclopedic. (comment on article Lahaina, Hawaii) Use of emotional words, unneutral 12 [...] but they still have not fixed the issue of neutral point of view. It is also evident in this article that the writer’s standpoint is biased towards Taylor Swift. Other than that, it did a good job at summarizing key points and putting depth into this. (comment on article Speak Now (Taylor’s Version)) “The film was also featured in an art and film festival hosted by The California Endowment, highlighting the power of stories in reshaping narratives about communities.” Yes, technically the source says that, but it’s a stretch to say in Wikipedia voice and just sounds like non-neutral, promotional prose. (comment on article Gehraiyaan) Polling from America shouldn’t be included and links to climate change shouldn’t be made unless explicitly connected by the source. (comment on article Typhoon Hinnamnor) Red herring fallacy, associating unrelated sources 11 Sourcing seems mostly fine, though some aren’t directly related (Ex. 39,40). (comment on article Gehraiyaan) Here is a lengthy digression about KISS, not necessary because the article on the band should be linked to. (comment on article 2022 AFL Grand Final) Missing important information 6 “One study, conducted by Sinéad Griffin, a physicist at the Lawrence Berkeley National Laboratory, provided some analysis of LK-99’s abilities using supercomputer simulations[20].” This is not enough information about the analysis, which would have been very useful in the article. (comment on article LK-99) Although the earthquake’s immediate aftermath and response are adequately covered, there could be more about the long-term socioeconomic impact and recovery processes. (comment on article 2022 West Java earthquake) Improper handling of time-sensitive information 5 Words like “now” should be avoided in Wikipedia articles to prevent them from becoming dated and phrases such as, “as of December 2023” should be used instead. (comment on article Cyclone Batsirai) “as of December 13” doesn’t specify a year, and is old information (comment on article 2022 West Java earthquake) too many subsections in the “Recovery and Rehabilitation” section (comment on article 2022 West Java earthquake) Section organization problem 5 I do not like how the article is organized, with too many headers cluttering the article, making it not as readable. Other than that, the AI did great work on the piece. (comment on article 2022 Crimean Bridge explosion) Table 11: Summary of major issues found on articles produced by STORM.

22. Figure 7: Screenshot of the web application for evaluating the grounded article.

23. Taylor Hawkins Oliver Taylor Hawkins (1972-2022) was an American musician, best known as the drummer for the rock band Foo Fighters[1]. Born in Fort Walton, Texas, Hawkins’ love for music was ignited at a young age, particularly after watching a Queen concert in 1982[2][3][5]. He kick-started his professional career as the drummer for Alanis Morissette’s band during her world tour for the hit album ’Jagged Little Pill’[8][9]. His talents were recognized by Dave Grohl, who invited him to join the Foo Fighters in 1997, marking the beginning of his impactful tenure with the band[7][8]. Hawkins was celebrated for his versatile drumming style, drawing influence from renowned drummers like Roger Taylor, Neil Peart, Phil Collins, Alex Van Halen, and Stewart Copeland[14]. His performances, marked by a unique energy and aggressive style of play, earned him recognition as one of the top rock drummers of his era[15]. Apart from his role in the Foo Fighters, Hawkins’ passion for music saw him involved in numerous side projects and collaborations, cementing his place in the world of rock music[10]. Outside of his professional life, Hawkins was known for his resilience and dedication to his family. Despite personal struggles, including a near-fatal drug overdose in 2001, Hawkins remained committed to his musical career[4][9]. His legacy continues to inspire musicians and fans alike, as his contributions to rock music, coupled with his indomitable spirit, made him an unforgettable icon in the music industry[13]. Hawkins’ sudden death in 2022 while on tour in Bogotá, Colombia, sent shockwaves through the music world[34]. Tributes poured in from around the globe, reflecting the respect and admiration Hawkins had garnered during his lifetime[21][31]. His life and career were honored at a star-studded tribute concert in Los Angeles, attesting to the enduring impact of his music[22]. # Early Life and Background

24. Oliver Taylor Hawkins, known as Taylor Hawkins, was born and raised in Fort Walton, Texas[3]. His family moved to Laguna Beach, California when he was four years old[3]. He has two younger siblings, a brother named Jason, and a sister named Heather[3]. As a child, Hawkins was particularly influenced by his paternal grandmother, Josie Hawkins, who had grown up during the Great Depression and lived in Jackson, Mississippi[1]. During his high school days at Laguna Beach High School, from where he graduated in 1990, he became friends with Jon Davison, who later became the lead vocalist of the band Yes[2][3]. His interest in music was nurtured from an early age, particularly after watching a Queen concert in 1982 which inspired him to learn to play the drums[2][5]. He noted that music was a constant presence in his family home[5]. Despite facing certain hardships during his upbringing, including his mother’s struggles with "demons", Hawkins pursued his musical ambitions[4]. He credits his older sister Heather for taking care of the family during difficult times[4]. His first major musical experience came from playing drums for Alanis Morissette’s album, Jagged Little Pill, and accompanying her on the subsequent tour[3]. This marked the beginning of his professional career in the music industry. # Career Taylor Hawkins began his professional music career playing in Alanis Morissette’s band during her 18-month world tour in support of the hit album ’Jagged Little Pill’ from 1995 to 1997[8][9]. His performances not only in the tour but also in the music videos for “You Oughta Know”, “All I Really Want” and “You Learn” introduced him to the world of rock music and ultimately led to his meeting with Dave Grohl[8]. Throughout this time, Hawkins contributed significantly to the band’s sound and performance, transforming the songs from their original drum loop format to a rock-band vibe that resonated with audiences[1][7]. In 1997, Hawkins was asked by Grohl to join the Foo Fighters, an invitation that he readily accepted[7][8]. At the time, Grohl thought it was a long shot to recruit Hawkins given that Morissette was at the height of her career, but Hawkins’ desire to be a part of a rock band compelled him to make the move[7]. This marked the beginning of Hawkins’ tenure as the drummer of the Foo Fighters, a role that he would play until his passing[6][9]. Apart from his work with Morissette and the Foo Fighters, Hawkins had an array of other musical experiences[10]. He drummed for Sass Jordan before joining Morissette’s touring band[10]. He was part of an ad hoc drum supergroup called SOS Allstars and filled the void for Coheed and Cambria’s 2007 album after their drummer Josh Eppard left the group[10]. In addition, Hawkins formed his own side project, the Coattail Riders, in 2005, through which he recorded his own music and took the project on the road, performing in small clubs despite the Foo Fighters’ arena-status[7]. His son, Shane Hawkins, has since taken on his father’s legacy, joining the Foo Fighters for a performance during the Boston Calling Music Festival in 2023[6]. # Musical Style and Influences Taylor Hawkins was a profound drummer, with his musical style and influences spreading across a wide array of rock genres[11]. Known for his passionate fandom of groups that came before him, Hawkins regularly expressed his admiration for bands like Rush, Genesis, and the Police, all of which featured some of the greatest drummers in rock history like Neil Peart, Phil Collins, and Stewart Copeland[11]. He was heavily influenced by his love for classic rock, as evidenced by his performances, where he covered songs from bands like Van Halen[11].

25. Hawkins drew influences from a variety of drumming styles, developing a signature style inspired by greats like Roger Taylor, Neil Peart, Phil Collins, Alex Van Halen, and Stewart Copeland[14]. This distinctive style and influence extended to his drum kit, which incorporated elements like rototoms and concert toms[14]. Beyond his influences, Hawkins had a unique energy that made him stand out as a drummer. His performances were recognized for their power, and he was known for his enthusiastic and aggressive style of play[15]. This earned him recognition as one of the top rock drummers of his time, with his passion for music living on through his performances[14]. Through his career, Hawkins left an indelible mark on rock music, through his distinct style, passion, and contributions to the music industry[13]. His love for music and dedication to his craft made him an unforgettable icon in the world of rock music[13]. # Personal Life Taylor Hawkins married Alison Hawkins, an American celebrity and entrepreneur, in 2005[18]. The couple had three children, Oliver, Annabelle, and Everleigh[19]. Hawkins’ commitment to his family was evident; in fact, he even wrote a song for his middle child, Annabelle[9]. In his personal life, Hawkins had also struggled with drug use, which nearly claimed his life in a 2001 overdose[9][7][4]. However, he managed to overcome this challenge, and later expressed gratitude for the experience as a lesson that allowed him to realize the destructive path he was on[7]. Outside of his main role in the Foo Fighters, Hawkins also pursued various side projects including the Birds of Satan, NHC, and Chevy Metal. His motivation for such ventures was a constant drive to create and his love for music[7]. Hawkins was also known for his unabashed fanboy nature, often vocalizing his admiration for fellow musicians and his heroes[7]. # Legacy and Impact Taylor Hawkins was known for his raw and authentic drumming style, described as "courageous, damaged and unflinchingly authentic"[20]. His work with the Foo Fighters, as well as his various collaborations and side projects, made him a celebrated figure in rock ‘n’ roll[10]. Hawkins’ death in 2022 was met with heartfelt tributes from colleagues and fans around the world. Notable tributes came from rock legends like Roger Taylor of Queen, who considered Hawkins as a kind, brilliant man and an inspirational mentor, likening his death to "losing a younger favourite brother"[21]. Similarly, Led Zeppelin’s Jimmy Page admired his technique, energy and spirited enthusiasm[21]. An LA tribute concert held in his honor included guest drummers like Lars Ulrich of Metallica, Travis Barker of blink-182, and Brad Wilk of Rage Against the Machine. Singers like Miley Cyrus and Alanis Morissette also performed at the concert[22]. Apart from his music, Taylor Hawkins also contributed to charities Music Support and MusiCares, both of which were chosen by the Hawkins family[23]. He had received numerous accolades throughout his career, including 27 Grammy nominations, of which he won 14[2]. In 2021, the Foo Fighters were inducted into the Rock and Roll Hall of Fame[9]. # Discography Taylor Hawkins also led a notable music career through his own side projects and collaborations[10]. Aside from his work with the Foo Fighters, Hawkins formed and fronted the band Taylor Hawkins & The Coattail Riders, a project which originated from jamming sessions with his friend Drew Hester[10]. ### Taylor Hawkins & The Coattail Riders

26. Taylor Hawkins & The Coattail Riders, a band formed in 2004, have released three albums and their music spans genres including Hard Rock, Art Rock, and Alternative Rock[24][25][26]. The band grew from an initial casual jamming session, gradually evolving into a more formal arrangement that led to the production of record albums. Notably, these albums featured guest appearances by renowned musicians such as Dave Grohl, Queen’s Brian May and Roger Taylor, The Cars’ Elliot Easton, Perry Farrell, and Jon Davison, who is a school friend of Hawkins’[10]. ### Red Light Fever Red Light Fever, released on April 19, 2010, was the band’s first album[29][30]. Prior to its release, Hawkins revealed in an interview that the album had completed the recording and production stages, but its title and release date were yet to be determined[29]. Red Light Fever was recorded at the Foo Fighters’ Studio 606 in California and featured guest musicians such as Brian May and Roger Taylor of Queen, Dave Grohl of Foo Fighters, and Elliot Easton of The Cars[29][30]. ## Get the Money Get the Money, the third album from Taylor Hawkins & The Coattail Riders, was released on November 8, 2019[29]. The album’s first single, "Crossed the Line", released on October 15, 2019, featured Dave Grohl and Jon Davison, the frontman of Yes[29]. The music video for the single "I Really Blew It" also featured appearances from Grohl and Perry Farrell[29]. # Collaborations and Guest Appearances Throughout his career, Taylor Hawkins collaborated with various prominent artists and bands. The Coattail Riders’ albums notably featured appearances from luminaries such as Brian May and Roger Taylor of Queen, Chrissie Hynde, Nancy Wilson of Heart, Sex Pistol Steve Jones and James Gang’s Joe Walsh[28]. Hawkins also fronted another group, The Birds of Satan, which evolved from his heavy rock covers band, Chevy Metal[28]. Despite his diverse musical engagements, Hawkins always maintained a close allegiance with the Foo Fighters, which remained the center of his music life[7][28]. # Tragic Passing Taylor Hawkins, the esteemed drummer of the alt-rock band Foo Fighters, passed away suddenly on March 25, 2022, while on tour with his band in Bogotá, Colombia[34]. The official cause of death was cardiac arrest, though inquiries were raised concerning the presence of drugs in his system and their potential contribution to his death[33][34]. On the night of his passing, paramedics were called to the Four Seasons hotel in Bogotá due to reports of chest pain from an unnamed guest, later revealed to be Hawkins[34]. Unfortunately, resuscitation efforts were unsuccessful, and Hawkins was declared dead at the scene[34]. The news of Hawkins’ sudden demise was announced on the morning of March 25th, 2022, which left the music world in shock[32]. The band confirmed the news with a short statement, expressing their devastation at the loss of Hawkins, whose "musical spirit and infectious laughter" would live on forever[32]. As a result of Hawkins’ untimely passing, the band canceled their ongoing South American tour[33]. The festival stage at the Estéreo Picnic Festival, where the Foo Fighters were scheduled to perform that night, was transformed into a candlelight vigil in memory of Hawkins[33]. ## Tributes and Remembrances

27. In the wake of Hawkins’ death, tributes from fans and colleagues alike poured in from around the world[21][31]. Among the many paying their respects were legendary rock and roll musicians like Roger Taylor, the drummer of Queen, who Hawkins credited with inspiring his own career behind the drum set[21]. In heartfelt social media posts, Taylor described Hawkins as an "inspirational mentor" and a "kind brilliant man"[21], while Led Zeppelin’s Jimmy Page reminisced about sharing the stage with Hawkins and praised his "technique, energy and spirited enthusiasm"[21]. There were also numerous onstage tributes to Hawkins. Notably, Miley Cyrus expressed her grief and sent peaceful wishes to the Foo Fighters and the Hawkins family during a performance at Lollapalooza[31]. Similarly, Liam Gallagher of Oasis dedicated one of the band’s biggest hits to Hawkins during a concert at the Royal Albert Hall in London[31]. Fans gathered outside the hotel where Hawkins died, lighting candles, leaving flowers, and singing the band’s songs in his honor[31]. Hawkins’ life and career were celebrated in a star-studded tribute concert in Los Angeles, which saw performances from over 50 musicians, including his former bands and colleagues from Def Leppard, Queen, and Foo Fighters[22]. Table 12: STORM’s generated article for “Taylor Hawkins”. “#”, “##” indicate the section title and subsection title respectively. Numbers in brackets indicate the cited references.