Structured RAG for Answering Aggregative Questions

如果无法正常显示，请先停止浏览器的去广告插件。

1. Preprint S TRUCTURED RAG Q UESTIONS FOR A NSWERING A GGREGATIVE Omri Koshorek Niv Granot Aviv Alloni Shahar Admati Roee Hendel Ido Weiss Alan Arazi Shay-Nitzan Cohen Yonatan Belinkov {omrik,nivg,aviva,shahara,roeeh,idow,alana,shayc,yonatanb}@ai21.com A BSTRACT Retrieval-Augmented Generation (RAG) has become the dominant approach for answering questions over large corpora. However, current datasets and methods are highly focused on cases where only a small part of the corpus (usually a few paragraphs) is relevant per query, and fail to capture the rich world of aggrega- tive queries. These require gathering information from a large set of documents and reasoning over them. To address this gap, we propose S-RAG, an approach specifically designed for such queries. At ingestion time, S-RAG constructs a structured representation of the corpus; at inference time, it translates natural- language queries into formal queries over said representation. To validate our ap- proach and promote further research in this area, we introduce two new datasets of aggregative queries: HOTELS and WORLD CUP. Experiments with S-RAG on the newly introduced datasets, as well as on a public benchmark, demonstrate that it substantially outperforms both common RAG systems and long-context LLMs. 1 1 I NTRODUCTION Retrieval-Augmented Generation (RAG) has emerged as a leading approach for the task of Open Book Question Answering (OBQA), attracting significant attention both in the research community and in real-world applications (Lewis et al., 2020; Guu et al., 2020; Yoran et al., 2023; Ram et al., 2023; Izacard et al., 2023; Gao et al., 2023; Siriwardhana et al., 2023; Fan et al., 2024). Most prior work has focused on simple queries, where the answer to a given question is explicitly mentioned within a short text segment in the corpus, and on multi-hop queries, which can be decomposed into smaller steps, each requiring only a few pieces of evidence. While RAG systems made substantial progress for the aforementioned query types, the task of an- swering aggregative queries still lags behind. Such queries require retrieving a large set of evidence units from many documents and performing reasoning over the retrieved information. Consider the real-world scenario of a financial analyst tasked with answering a question such as, ‘What is the average ARR for South American companies with more than 1,000 employees?’. While such a query could be easily answered given a structured database, it becomes significantly harder when the corpus is private and unstructured. In this setting, RAG systems cannot rely on the LLM’s para- metric knowledge; instead, they must digest the unstructured corpus and reason over it to generate an answer, introducing several key challenges: Information about the ARR of different companies is likely to be distributed across many documents, and even if the full set of relevant evidence is retrieved, the LLM must still perform an aggregative operation across them. Moreover, aggregative queries often involve complex filtering constraints (e.g., ‘before 2020’, ‘greater than 200 kg’), which vector-based retrieval systems often struggle to handle effectively (Malaviya et al., 2023). Current RAG systems handle aggregative questions by supplying the LLM with a textual context that is supposed to contain the information required to formulate an answer. This context is constructed 1 Core Contributors: OK, NG, AvAl, SA, YB. Project management: OK, NG, SNC, YB. Hands-on imple- mentation, research and development: OK, AvAl, SA, NG. Additional experiments: RH, IW. Paper Writing: OK, YB, NG, SNC, AlAr. 1

2. Preprint Figure 1: S-RAG overview. Ingestion phase (upper): given a small set of questions and documents, the system predicts a schema. Then it predicts a record for each document in the corpus, populating a structured DB. Inference phase (lower): A user query is translated into an SQL query that is run on the database to return an answer. either by retrieving relevant text units using vector-based representations, or by providing the entire corpus as input, leveraging the extended context windows of LLMs. Both strategies, however, face substantial limitations in practice. Vector-based retrieval often struggles to capture domain-specific terminology, depends on document chunking and therefore limits long-range contextualization, and requires predefining the number of chunks to retrieve as a hyperparameter (Weller et al., 2025). Conversely, full-context approaches are restricted by the LLM’s context size and its limited long- range reasoning capabilities (Xu et al., 2023). In this work, we introduce Structured Retrieval-Augmented Generation (S-RAG), a system designed to address the limitations of existing techniques in answering aggregative queries over a private cor- pus. Our approach relies on the assumption that each document in the corpus represents an instance of a common entity, and thus documents share recurring content attributes. During the ingestion phase, S-RAG exploits those commonalities. Given a small set of documents and representative questions, a schema that captures these attributes is induced. For example, in a corpus where each document corresponds to a hotel, the predicted schema might include attributes such as hotel name, city, and guest rating. Given the prediction, each document is mapped into an instance of the schema, and all resulting records are stored in a database. At inference time, the user query is translated into a formal language query (e.g., SQL), which is run over the ingested database. Figure 1 illustrates the ingestion phase (in the upper part) and inference phase (in the lower part). To facilitate future research in this area, we introduce two new datasets of aggregative question answering: (1) H OTELS : a fully synthetic dataset composed of generated booking-like hotel pages, alongside aggregative queries (e.g., ‘What is the availability status of the hotel page with the highest number of reviews?’); and (2) W ORLD C UP : a partially synthetic dataset, with Wikipedia pages of FIFA world cup tournaments as the corpus, alongside generated aggregative questions. Both datasets contain exclusively aggregative questions that require complex reasoning across dozens of text units. 2 We evaluate the proposed approach on the two newly introduced datasets, as well as on Fi- nanceBench (Islam et al., 2023), a public benchmark designed to resemble queries posed by fi- nancial analysts. Experimental results demonstrate the superiority of our approach compared to vector-based retrieval, full-corpus methods, and real world deployed services. To conclude, our main contributions are as follows: 2 The datasets are publicly available at: aggregative_questions https://huggingface.co/datasets/ai21labs/ 2

3. Preprint 1. We highlight the importance of aggregative queries over a private corpus for real-world sce- narios and demonstrate the limitations of existing benchmarks and methods in addressing this challenge. 2. We introduce two new datasets, H OTELS and W ORLD C UP , specifically designed to sup- port future research in this direction. 3. We propose a novel approach, S-RAG, for handling aggregative queries, and show that it significantly outperforms existing methods. 2 A GGREGATIVE Q UESTIONS OVER U NSTRUCTURED C ORPUS Retrieval-augmented generation (RAG) has become the prevailing paradigm for addressing the Open-Book Question Answering (OBQA) task in recent research (Gao et al., 2023; Asai et al., 2024; Wolfson et al., 2025), and it is now widely adopted in industrial applications as well. Substan- tial progress has been made in answering simple queries, for which the answer is explicitly provided within a single document. In addition, considerable effort has focused on improving performance for multi-hop questions, which require retrieval of only a few evidence units per hop (Yang et al., 2018; Trivedi et al., 2022; Tang & Yang, 2024). Despite this progress, aggregative questions, where answering a question requires retrieval and reasoning over a large collection of evidence spread across a large set of documents, remain relatively unexplored. Yet aggregative questions are highly relevant in practical settings, especially for organizations work- ing with large, often unstructured, private collections of documents. For instance, an HR specialist might query a collection of CVs with a question such as ‘What is the average number of years of education for candidates outside the US?’. Although the documents in such a corpus are written independently and lack a rigid structure, we can assume that all documents share some information attributes, like the candidate’s name, years of education, previous experience, and others. Standard RAG systems address the OBQA task by providing an LLM with a context composed of retrieved evidence units relevant to the query (Lewis et al., 2020; Ram et al., 2023). The retrieval part is typically performed using dense or sparse text embeddings. Such an approach would face several challenges when dealing with aggregative queries: 1. Completeness: Failing to retrieve a single required piece of evidence might lead to an incorrect or incomplete answer. For example, consider the question ‘Who is the youngest candidate?’ – all of the CVs in the corpus must be retrieved to answer correctly. 2. Bounded context size: Since the LLM context has a fixed token budget, typical RAG systems define a hyper-parameter K for the number of chunks to retrieve. Any question that requires integrating information from more than K sources cannot be fully addressed. Furthermore, the resulting context might be longer than the LLM’s context window. 3. Long-range contextualization: Analyst queries often target documents with complex structures containing deeply nested sections and subsections (e.g., financial reports). Con- sequently, methods that rely on naive chunking are likely to fail to capture the full semantic meaning of such text units (Antropic, 2024). 4. Embedders limitation: As shown by (Weller et al., 2025), there are inherent representa- tional limitations to dense embedding models. Furthermore, sparse and dense embedders are likely to struggle to capture the full semantic meaning of filters (Malaviya et al., 2023), especially when handling named entities to which they were not exposed at training time. 3 S-RAG: S TRUCTURED R ETRIEVAL A UGMENTED G ENERATION This section describes S-RAG, our proposed approach for answering aggregative questions over a domain specific corpus. Similarly to vector-based retrieval, we suggest a pipeline consisting of an offline Ingestion phase (§3.2) and an online Inference phase (§3.3). See Figure 1 for an illustration. 3.1 P RELIMINARIES Consider a corpus D = {d 1 , d 2 , . . . , d n } of n documents, where each document d i corresponds to an instance of an entity, described by a schema S = {a 1 , a 2 , . . . , a m }, where each a j denotes a 3

4. Preprint primitive attribute with a predefined type. For example, in a corpus of CVs, the entity type is a CV, and the underlying schema may include attributes such as an integer attribute ‘years of education’ and a string attribute ‘email’. For each document d i , we define a mapping to its record r: r(d i ) = {(a j , v ij ) | a j ∈ S}, (1) where v ij is the value of attribute a j expressed in document d i . Importantly, the value v i,j may be empty in a document d i . An aggregative question typically involves examining a j and the corre- sponding set {v 1,j , v 2,j , . . . , v n,j }, optionally combining a reasoning step. This formulation can be naturally extended to multiple attributes. Figure 2 illustrates our settings. Figure 2: Illustration of a naive CVs corpus, schema and a single record. An example of an aggregate query on such a corpus could be: ‘Which candidates has more than two years of experience?’ 3.2 I NGESTION The ingestion phase of S-RAG aims to derive a structured representation for each document in the corpus, capturing the key information most likely to be queried. This process consists of two steps: 3.2.1 S CHEMA PREDICTION In this step, S-RAG predicts a schema S = {a 1 , a 2 . . . a m } that specifies the entity represented by each document in the corpus. The schema is designed to capture recurring attributes across documents, i.e. attributes that are likely to be queried at inference time. We implement this stage using an iterative algorithm in which an LLM is instructed to create and refine a JSON schema given a small set of documents and questions. The LLM is prompted to predict a set of attributes, and to provide for each attribute not only its name but also its type, description, and several example values. The full prompts used for schema generation are provided in Appendix B. 3 We do zero-shot prompting with 12 documents and 10 questions, quantities tailored for real-world use cases, where a customer is typically expected to provide only a small set of example documents and queries. 3.2.2 R ECORD PREDICTION Given a document d i and a schema S, we prompt an LLM to predict the corresponding record r i , which contains a value for each attribute a j ∈ S. The LLM is provided with the list of attribute names, types, and descriptions, and generates the output set {v i,1 , v i,2 , . . . , v i,m }. Each predicted value v i,j is then validated by post-processing code to ensure it matches the expected type of a j . Since the meaning of a value v i,j can be expressed in multiple ways (e.g., the number one million may appear as 1,000,000, 1M, or simply 1), attribute descriptions and examples are crucial for guiding the LLM in lexicalizing v i,j (e.g., capitalization, units of measure). Because the same descriptions and examples are shared across the prediction of different records, this process enables cross-document standardization. After applying this prediction process to all documents in the corpus D, we store the resulting set of records {r 1 , r 2 , . . . , r n } in an SQL table. Finally, we perform post-prediction processing to compute attribute-level statistics based on their types (more details are provided in Appendix D). These statistics are used at inference time, as detailed next. 3 For simplicity at inference time, we exclude list and nested attributes, since these would require reasoning over multiple tables. 4

5. Preprint 3.3 I NFERENCE At inference time, given a free-text question q, an LLM is instructed to translate it into a formal query over the aforementioned SQL table. To enhance the quality of the generated query and avoid ambiguity, the LLM receives as input the query q, the schema S and statistics for every column in the DB. These statistics guide the LLM in mapping the semantic meaning of q to the appropriate lexical filters or values in the formal query. The resulting query is executed against the SQL table, and the output is stringified and supplied to the LLM as context. Hybrid Inference Mode When the predicted schema fails to capture certain attributes, particularly rare ones, the answer to a free-text query cannot be derived directly from the SQL table. In such cases, we view our system as an effective mechanism for narrowing a large corpus to a smaller set of documents from which the answer can be inferred. To support this use case, we experimented with H YBRID -S-RAG, which operates in two inference steps: (i) translating q into a formal query whose execution returns a set of documents (rather than a direct answer), and (ii) applying classical RAG on the retrieved documents. 4 A GGREGATIVE Q UESTION A NSWERING D ATASETS While numerous OBQA datasets have been proposed in the literature, most of them consist of simple or multi-hop questions (Abujabal et al., 2018; Malaviya et al., 2023; Tang & Yang, 2024; Cohen et al., 2025). To support research in this area, we introduce two new aggregative queries OBQA datasets: H OTELS and W ORLD C UP . The former is fully synthetic, containing synthetic documents and questions, while the latter contains synthetic questions over natural documents. 4.1 AGGREGATIVE D ATASETS C REATION M ETHOD To create a dataset of aggregative questions, we start by constructing a schema S that describes an entity (e.g., hotel). S consists of m attributes (e.g. city, manager name, etc.), each defined by a name, data type, and textual description. We then generate n records of S by employing an LLM or code-based randomization. Each generated record corresponds to a distinct entity (e.g., Hilton Paris, Marriott Prague). We then apply LLMs in two steps: (1) given a structured record r i , verbalize its attributes into a natural language html document d i (see Appendix C); and (2) given a random subset of records, formulate an aggregative query over them and verbalize it in natural language. 4.2 H OTELS AND W ORLD C UP DATASETS Hotels. This dataset is constructed around hotel description pages, where each entity e corresponds to a single hotel. Each page contains basic properties such as the hotel name, rating, and number of stars, as well as information about available facilities (e.g., swimming pool, airport shuttle). An example document is provided in Appendix C. Using our fully automatic dataset generation pipeline, we produced both the documents and the associated question-answer pairs. Our document generation process ensures that some of these properties are embedded naturally within regular sentences, unlike other unstructured benchmarks, which often present properties in a table or within a dedicated section of the document (Arora et al., 2023). The resulting dataset consists of 350 documents and 193 questions. We consider this dataset to be more challenging, as public LLMs have not been exposed to either the document contents or the questions. World Cup. This dataset targets questions commonly posed within the popular domain of inter- national soccer. The corpus consists of 22 Wikipedia pages, each corresponding to one of the FIFA World Cup tournaments held between 1930 and 2022. To increase the difficulty of the corpus, we removed the main summary table from each document, as it contains structured information about many key attributes. Based on this corpus, we manually curated 22 structured records and used the automatic method described in §4.1 to generate 83 aggregative questions. Although LLMs are likely to possess prior knowledge of this corpus, evaluating RAG systems on these aggregative questions provides an interesting and challenging benchmark. Table 1 summarizes the statistics of the introduced datasets. It also compares them to F I - NANCE B ENCH (Islam et al., 2023), a public benchmark designed to resemble queries posed by 5

6. Preprint financial analysts. In contrast to our new datasets, questions in FinanceBench typically require up to a single document to answer correctly (usually a single page). Dataset Table 1: Statistics and characteristics of datasets used in our experiments. # Documents Avg. Tokens / Doc # Queries Aggregative LLM leak? Hotels World Cup FinanceBench 5 350 22 360 596 18881 109592 193 88 150 High High Low × ✓ ✓ E XPERIMENTAL S ETTINGS 5.1 B ASELINES We implement V ECTOR RAG, a classic embedder based approach. It performs chunking and dense embedding at ingestion time, followed by chunk retrieval using a dense embedder at inference time (see Appendix A). We note that V ECTOR RAG is on-par with the best performing method reported by Wang et al. (2025) on F INANCE B ENCH , and therefore we consider it as a well performing system. In addition, we provide results of F ULL C ORPUS pipeline, in which each document is truncated to a maximum length of 20,000 tokens. The context is then constructed by concatenating as many of these document prefixes as can fit within the LLM’s context window. We also report the performance of a real-world deployed system, O PEN AI-R ESPONSES by OpenAI (OpenAI, 2025). This agentic framework supports tool use, including the FileSearch API. Although it is a broader LLM-based system with capabilities extending beyond RAG, we include it in our eval- uation for completeness. Unlike the baselines we implemented, O PEN AI-R ESPONSES is a closed system that directly outputs the answer, limiting our control on its internal implementation. 5.2 S-RAG V ARIANTS S-RAG is evaluated in three settings: (i) S-RAG-GoldSchema: skip the Schema Prediction phase, and provide an oracle schema to S-RAG. This schema contains all the relevant attributes to an- swer all of the queries in all aggregative benchmarks, (ii) S-RAG-InferredSchema: predict schema based on a small set of documents and queries which are later discarded from the dataset, and, (iii) HYBRID-S-RAG: as explained in §3.3, we use S-RAG to narrow down the corpus and perform V ECTOR RAG over the resulting sub-corpus. 5.3 A NSWER G ENERATOR Every RAG system includes an answer generation step, in which an LLM generates an answer given the retrieved context and the input question. For S-RAG, we employ GPT-4o for this step. In con- trast, for the baselines V ECTOR RAG and F ULL C ORPUS , we use GPT-o3 with stronger reasoning capabilities. This ensures fairness, since in our setting the reasoning steps are handled in SQL, while in the baselines the LLM must perform them. In addition, to minimize the influence of the model’s prior knowledge, we explicitly instructed the LLM in all experiments to generate answers solely on the basis of the provided context, disregarding any external knowledge. 5.4 E VALUATION D ATASETS We evaluate S-RAG on the two newly introduced datasets, H OTELS and W ORLD C UP , as well as on the publicly available evaluation set of F INANCE B ENCH . Since the F INANCE B ENCH test set includes both aggregative and non-aggregative queries, we report results on the full test set as well as on the subset of 50 queries identified by the original authors as aggregative 4 . 4 Referred to as the “metrics-generated queries” 6

7. Preprint In order to estimate the familiarity of existing LLMs with our evaluation sets, we build a context-less question answering pipeline, where a strong reasoning model was asked to answer the question with- out any provided context. Table 2 shows the performance of GPT-o3 in this setting. As expected, GPT-o3 fails on H OTELS as it includes newly generated documents, but surprisingly achieves an AnswerComparison score of 0.71 on W ORLD C UP . We consider the results on H OTELS as evidence that only a robust pipeline can succeed on this dataset, while the strong performance on W ORLD C UP likely reflects the familiarity of modern LLMs with Wikipedia content. Table 2: Zero-shot performance of o3 without any provided context. Dataset Answer Recall Answer Comparison FinanceBench Hotels WorldCup 5.5 0.443 0.047 0.798 0.505 0.049 0.712 M ETRICS Following prior work on evaluating question answering systems, we adopt the LLM-as-a-judge paradigm (Zheng et al., 2023). Specifically, to compare the expected answer with the system gen- erated answer, we define two evaluation metrics: (1) Answer Comparison, where the LLM is instructed to provide a binary judgment on whether the generated answer is correct given the query and the expected answer (the prompt is provided in Appendix E); and (2) Answer Recall, where an LLM-based system decomposes the expected answer into individual claims and computes the percentage of those claims that are covered in the generated answer. 6 R ESULTS Table 3 summarizes the results of S-RAG and the baselines when evaluated on the aggregative ques- tions evaluation sets. Across all datasets, S-RAG consistently outperforms the baselines, although those systems employ a strong reasoning model when possible. F ULL C ORPUS : All datasets exceed GPT-o3’s context window, and therefore it can’t process the full corpus directly (which is a major difference compared with Wolfson et al. (2025)). As expected, this baseline fails to achieve strong results on any dataset. H OTELS is relatively smaller, leading to reasonable performance, but a real-world use cases involve much larger corpora. V ECTOR RAG & OAI-R ESPONSES : Results for both V ECTOR RAG and OAI-R ESPONSES are reasonable (∼10-20% behind S-RAG-GoldSchema) when parametric knowledge is available (F INANCE B ENCH , W ORLD C UP ), however, it falls short on H OTELS (∼50-60% behind S-RAG- GoldSchema). As discussed in §2, vector-based retrieval suffers from inherent limitations when considering aggregative questions. This is most prominent with H OTELS , where the generating model is unable to compensate suboptimal retrieval with parametric knowledge. This also holds for OAI-R ESPONSES , even though it is able to execute multiple retrieval calls, which exemplifies the completeness issue (the backbone model cannot tell when to stop the retrieval). S-RAG-InferredSchema: For simpler documents, like the generated H OTELS , or Wikipedia pages of W ORLD C UP tournaments, our system is solid, which leads to overall strong performance. There is a degradation in performance compared with GoldSchema. This stems from failures in the schema prediction phase, specifically: (i) missing attributes; (ii) incomplete descriptions which lead to standardization issues in the DB. This problem intensifies with complex documents such as in F INANCE B ENCH , leading to poor performance. For example, we saw that the CapitalExpenditure attribute was described as ”The capital expenditure of the company”. Thus, in the record prediction phase (§3.2.2) two values were recorded as 1, but one of them stands for 1M and the other for 1B which makes it unusable at inference time. However, given that manually building the gold schema via prompting required only a few hours, we regard this as a practical and feasible approach for real-world applications. S-RAG-GoldSchema: Best results are achieved across datasets when providing the gold schema. The imperfect scores can be attributed to imperfect text-to-sql conversion, standardization issues in the ingestion phase, and wrong records prediction. 7

8. Preprint Hotels VectorRAG FullCorpus OAI-Responses S-RAG S-RAG — — — InferredSchema GoldSchema 0.352 0.478 0.253 0.500 0.845 0.331 0.473 0.184 0.518 0.899 VectorRAG FullCorpus OAI-Responses S-RAG S-RAG — — — InferredSchema GoldSchema 0.735 0.516 0.715 0.766 0.909 0.676 0.441 0.566 0.769 0.856 Table 3: Results of different systems on the aggregative evaluation sets. Dataset System Ingestion Type Answer Recall Answer Comparison VectorRAG FullCorpus OAI-Responses S-RAG S-RAG — — — InferredSchema GoldSchema 0.650 0.100 0.670 0.230 0.750 0.598 0.040 0.593 0.234 0.725 Finally, Table 4 shows the performance of H YBRID -S-RAG with gold schema on the full F INANCE B ENCH , including aggregative and non-aggregative queries. The superior results of H YBRID -S-RAG demonstrate that S-RAG can perform well also on general purpose datasets. Table 4: Performance on the full FinanceBench evaluation set. System Answer Recall Answer Comparison V ECTOR RAG OAI-R ESPONSES H YBRID -S-RAG 0.598 0.529 0.667 0.677 0.553 0.702 Qualitative Examples. Table 5 presents the answers generated by different systems for the natural aggregative query, ‘What is the average number of total goals scored across all World Cups in this dataset?’, from the W ORLD C UP dataset. Both V ECTOR RAG and F ULL C ORPUS produce the wrong answer: despite the reasonable reasoning chain, the incomplete context results in an incorrect answer. In contrast, S-RAG delivers a concise and accurate answer, demonstrating its performance on aggregative queries that require reasoning over a large set of evidence across multiple documents. 7 7.1 R ELATED W ORK RAG SYSTEMS Modern RAG systems typically address the Open-Book Question-Answering task by retrieving the text units from the corpus that are most relevant for answering the query according to some rel- evance score (Lewis et al., 2020; Ram et al., 2023). At the ingestion phase, a standard system splits each document independently into a set of chunks and computes a vector representation for each chunk. These representations are obtained either through sparse embeddings (Robertson et al., 2009), which represent text as high-dimensional and interpretable vectors based on explicit lexi- cal features, or dense embeddings (Muennighoff et al., 2022; Wang et al., 2022), which encode text into low-dimensional continuous vectors that capture semantic similarity, enabling effective retrieval even when queries and documents share little lexical overlap. The retrieval phase is typically carried out by scoring the relevance of each chunk to the query, using their vector representations, and op- tionally applying post-retrieval re-ranking on the top scoring chunks, utilizing a model that jointly encodes the chunk and the query. In addition to domain-agnostic approaches, corpus-specific training has also been explored, for ex- ample by Wang et al. (2025), though such methods suffer from limited scalability. Among structure- 8

9. Preprint Table 5: An example from W ORLD C UP dataset with answers generated by different systems. Question: What is the average number of total goals scored across all World Cups in this dataset? Gold Answer: The average number total goals scored across all World Cups is approx. 123.64. System Generated Answer Explanation V ECTOR RAG Adding the total-goal for every World Cup the context lists: 1950 – 88, . . . ,2022 – 172; Total goals = 1,739 tournaments = 13 Average = 1,739 ÷ 13 ≈ 133.8 The context includes information on only 13 tournaments; the 1954 World Cup, for instance, is missing. F ULL C ORPUS Across the eight World Cup tourna- ments cited:•1934: 70 goals . . . • To- tal goals = 70 + 126 + . . . + 171 = 954 Average = 954 ÷ 8 ≈ 119.25 The context includes only eight tour- nament documents, as context length of the LLM is limited. S-RAG Average = 123.64 S-RAG produces a short and correct answer based on the information from the twenty two tournaments. based methods, Edge et al. (2024) propose constructing a knowledge graph at ingestion time to cap- ture information essential for answering queries. However, their approach is primarily designed for global sense-making questions and is not built to handle aggregative queries (as it does not enforce a recurring structure in the graph which is the cornerstone of such queries). Another noteworthy con- tribution is by Arora et al. (2023), who propose building structured representation of an unstructured corpus. Nevertheless, their system was not evaluated in the context of RAG performance. 7.2 O PEN -B OOK QA DATASETS Most existing OBQA datasets include simple questions for which the answers are explicitly con- tained within an individual text segment of the corpus, or require reasoning over no more than a handful of such evidence pieces (Nguyen et al., 2016; Abujabal et al., 2018; Yang et al., 2018; Trivedi et al., 2022; Malaviya et al., 2023; Tang & Yang, 2024; Cohen et al., 2025). This ten- dency arises as annotating questions and answers is considerably easier when focusing on small number of text units. Others construct questions that require the integration of a larger number of evidence units (Wolfson et al., 2025; Amouyal et al., 2023); however, these datasets do not focus on large-scale retrieval, and are based on Wikipedia, a source which LLMs are well exposed to during pretraining. This underscores the need for new datasets that require multi-document retrieval over unseen corpora, while also involving diverse reasoning skills such as numerical aggregation. 8 C ONCLUSIONS In this work, we highlight the importance of aggregative questions, which require retrieving and reasoning over information distributed across a large set of documents. To foster further research on this problem, we introduce two new aggregative questions datasets: W ORLD C UP and H OTELS . To address the challenges such datasets pose, we propose S-RAG, a system that transforms unstruc- tured corpora into a structured representation at ingestion time and translates questions into formal queries at inference time. This design addresses the limitations of classic RAG systems when an- swering aggregative queries, enabling effective reasoning over dispersed evidence. Our work has a few limitations: First, our approach is limited to corpora that can be represented by a single schema, whereas in the real world a corpus may contain documents derived from multiple schemas. In addition, the schemas underlying the datasets we experiment with include only simple attributes, and we encourage future research on corpora that incorporate more complex structures. In our experiments, S-RAG achieves strong results on the newly introduced datasets and on the public F INANCE B ENCH benchmark, even compared to top-performing RAG methods and advanced reasoning models. We further show that the schema prediction step plays a critical role in end-to-end performance, highlighting an important direction for future research. 9

10. Preprint To conclude, our work puts emphasis on aggregative queries, a crucial, realistic blindspot of current RAG systems, and argues that unstructured, classical methods alone are ill-suited to address them. By introducing new datasets tailored to evaluate such queries, and designing a structured solution, we hope to pave the way to next generation RAG systems. A CKNOWLEDGMENTS We thank our colleagues Raz Alon and Noam Rozen from AI21 Labs for developing key algorithmic components used in this research. We also thank Inbal Magar and Dor Muhlgay for reading the draft and providing valuable feedback. R EFERENCES Abdalghani Abujabal, Rishiraj Saha Roy, Mohamed Yahya, and Gerhard Weikum. Comqa: A community-sourced dataset for complex factoid question answering with paraphrase clusters. arXiv preprint arXiv:1809.09528, 2018. Samuel Amouyal, Tomer Wolfson, Ohad Rubin, Ori Yoran, Jonathan Herzig, and Jonathan Berant. Qampari: A benchmark for open-domain questions with many answers. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pp. 97–110, 2023. Antropic. Contextual retrieval, 2024. contextual-retrieval. URL https://www.anthropic.com/news/ Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trum- mer, and Christopher Ré. Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv preprint arXiv:2304.09433, 2023. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. 2024. Dvir Cohen, Lin Burg, Sviatoslav Pykhnivskyi, Hagit Gur, Stanislav Kovynov, Olga Atzmon, and Gilad Barkan. Wixqa: A multi-dataset benchmark for enterprise retrieval-augmented generation. arXiv preprint arXiv:2505.08643, 2025. Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024. Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 6491–6501, 2024. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2(1), 2023. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval- augmented language model pre-training. ArXiv, abs/2002.08909, 2020. URL https://api. semanticscholar.org/CorpusID:211204736. Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vid- gen. Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944, 2023. Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251): 1–43, 2023. 10

11. Preprint Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented gener- ation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33: 9459–9474, 2020. Chaitanya Malaviya, Peter Shaw, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Quest: A retrieval dataset of entity-seeking queries with implicit set operations. arXiv preprint arXiv:2305.11694, 2023. Niklas Muennighoff, Nouamane Tazi, Loı̈c Magne, and Nils Reimers. Mteb: Massive text embed- ding benchmark. arXiv preprint arXiv:2210.07316, 2022. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human-generated machine reading comprehension dataset. 2016. OpenAI. Oai response, 2025. new-tools-for-building-agents/. URL https://openai.com/index/ Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023. Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and be- yond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009. Shamane Siriwardhana, Rivindu Weerasekera, Elliott Wen, Tharindu Kaluarachchi, Rajib Rana, and Suranga Nanayakkara. Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11:1–17, 2023. Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi- hop queries. arXiv preprint arXiv:2401.15391, 2024. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022. Xinyu Wang, Jijun Chi, Zhenghan Tai, Tung Sum Thomas Kwok, Muzhi Li, Zhuhong Li, Hailin He, Yuchen Hua, Peng Lu, Suyuchen Wang, Yihong Wu, Jerry Huang, Jingrui Tian, and Ling Zhou. Finsage: A multi-aspect rag system for financial filings question answer- ing. ArXiv, abs/2504.14493, 2025. URL https://api.semanticscholar.org/ CorpusID:277955764. Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. On the theoretical limitations of embedding-based retrieval. arXiv preprint arXiv:2508.21038, 2025. Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sab- harwal, and Reut Tsarfaty. Monaco: More natural and complex questions for reasoning across dozens of documents. arXiv preprint arXiv:2508.11133, 2025. Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025, 2023. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018. Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. Making retrieval-augmented language models robust to irrelevant context. arXiv preprint arXiv:2310.01558, 2023. 11

12. Preprint Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023. 12

13. Preprint A V ECTOR RAG I MPLEMENTATION D ETAILS The V ECTOR RAG implementation is as follows: At ingestion time, each document is split into non-overlapping chunks of 500 tokens, and the Qwen2-7B-instruct embedder 5 is applied to obtain dense representations for each chunk. We store each chunk along with its embedded representation in an Elasticsearch index. At inference time, given a query q, we use the same embedder to encode the query and retrieve the top 40 chunks with the highest similarity scores. The retrieved chunks are concatenated into a single context, with each chunk separated by a special delimiter token. We do not incorporate a sparse retriever (e.g., BM25) or re-ranking modules, as preliminary experiments showed that they did not yield performance improvements across datasets. B S CHEMA P REDICTION TECHNICAL DETAILS We ran the iterative algorithm for four iterations, employing GPT-4o as the underlying LLM. The prompts we used in the schema generation phase are: Schema generation prompt - first iteration Task: Extract a single JSON schema from the provided documents. I’ll provide you with a set of documents. Your task is to analyze these documents and identify recurring concepts. Then, build a single JSON schema that exhaustively captures *all* these concepts across all documents. Focus specifically on identifying patterns that appear consistently across multiple documents. Present your response as a complete JSON schema with the following structure: ‘‘‘json { "title": "YourSchemaName", "type": "object", "properties": { "fieldName": { "type": "string", "description": "Detailed description of the field, at least two sentences.", "examples": ["example1", "example2"] } }, "required": ["fieldName"] } When building the schema: - Avoid object-type fields with additional nested properties when possible. - Avoid list. Instead use boolean attribute for each of the potential value. - Make sure to capture all recurring concepts - Relevant concepts may include locations, dates, numbers, strings, etc. - Relevant concepts should not be lengthy strings (e.g. a "description" field is not a good choice), you should rather decompose into separate fields if possible. 5 https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct 13

14. Preprint Schema generation prompt - second iteration and on Task: Refine an existing JSON schema based on set of questions and documents analysis I’ll provide you with an existing JSON schema, set of questions, and a set of documents. The JSON schemas of different documents will be converted into an SQL table, that will be used as knowledge source to answer questions that are similar to the provided questions. Your task is to analyze what attributes from the documents can provide answers to questions similar to the provided questions, and refine the existing schema. Make sure that the attribute value can be extracted (and not inferred) from each of the documents. Provide the final refined JSON schema implementation: ‘‘‘json { "title": "RefinedSchemaName", "type": "object", "properties": { "propertyName": { "type": "string", "description": "Detailed description of the property, at least two sentences.", "examples": ["example1", "example2"] } }, "required": ["propertyName"] } In addition for each attribute and document provide the value of the attribute in the document. When evaluating the existing schema: - Make sure that every property can be extracted from each of the documents - Modify properties where the name, type, or definition could be improved - Add new properties for concepts that can help answer the questions. E.g.: if a question is about "the most common location", you should add a property for "location" if it doesn’t exist. Make sure that the property value can be extracted from each of the documents. - Add new properties for recurring concepts not captured in the existing schema - Add new properties for trivial concepts that are missing in the existing schema. E.g: If the schema represents a house for sale, it must include the seller’s name. - Use appropriate JSON Schema types (string, number, integer, boolean, array, etc.) - Provide descriptions and examples for each property - Avoid nested object properties - Fields should not be lengthy strings (e.g. a "description" field is not a good choice), you should rather decompose into separate fields if possible. - Avoid assigning values to the attributes in the schema. You should only provide the schema itself, without any values. For each property decision, provide a clear rationale based on related question or patterns observed in the documents. Your goal is to create a refined schema that better captures the recurring patterns that can be used to answer the questions while minimizing unnecessary changes to the existing structure. 14

15. Preprint C E XAMPLE H OTELS D OCUMENT Example document from the H OTELS dataset: Figure 3: A randomly selected document from the HOTELS dataset D A TTRIBUTE S TATISTICS After applying record prediction to all documents in the corpus, we compute attribute-level statis- tics. For numeric attributes, we calculate the mean, maximum, and minimum values; for string and boolean attributes, we include the set of unique values predicted by the LLM. For all attributes, regardless of type, we also include the number of non-zero and non-null values. E JLM PROMPT For both metrics, we employ GPT-4o as the underlying judging model. 15

16. Preprint Answer Comparison <instructions> You are given a query, a gold answer, and a judged answer. Decide if the judged answer is a correct answer for the query, based on the gold answer. Do not use any external or prior knowledge. Only use the gold answer. Answer Yes if the judged answer is a correct answer for the query, and No otherwise. <query> {query} </query> <gold_answer> {gold_answer} </gold_answer> <judged_answer> {judged_answer} </judged_answer> </instructions> F LLM U SE In addition to the uses of LLMs described throughout the paper—for dataset creation, ingestion, and inference—we also employed ChatGPT to help identify mistakes (such as grammar and typos) and to improve the phrasing of paragraphs we wrote. 16