Agentic Context Engineering- Evolving Contexts for Self-Improving Language Models

如果无法正常显示，请先停止浏览器的去广告插件。

1. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models Qizheng Zhang 1 ∗ Changran Hu 2 ∗ Shubhangi Upasani 2 Boyuan Ma 2 Fenglu Hong 2 Vamsidhar Kamanuru 2 Jay Rainton 2 Chen Wu 2 Mengmeng Ji 2 Hanchen Li 3 Urmish Thakker 2 James Zou 1 Kunle Olukotun 1 1 Stanford University 2 SambaNova Systems, Inc. 3 UC Berkeley ∗ equal contribution Abstract Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation—modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead. 1 Introduction Agent: AppWorld 60.0 59.5% 51.9% 52.5 40.0 e Bas 74 46.0% 46.4% 47.5 LLM 72.3% 73.5% ICL GEPA DC ACE e Bas 76.5% 71.5% 72 70 68 70 68 78 74 74.2% 72 70.7% 42.4% 80 Numerical Reasoning: Formula 76 76 50.0 42.5 78.3% 78 55.0 45.0 Domain Knowledge: FiNER 82 80 57.5 # qizhengz@stanford.edu, changran.hu@sambanovasystems.com 69.5% 67.5% 67.0% 66 LLM ICL GEPA DC ACE e Bas LLM ICL GEPA DC ACE Figure 1: Overall Performance Results. Our proposed framework, ACE, consistently outperforms strong baselines across agent and domain-specific reasoning tasks. Modern AI applications based on large language models (LLMs), such as LLM agents [49, 52] and compound AI systems [55], increasingly depend on context adaptation. Instead of modifying model weights, context

2. adaptation improves performance after model training by incorporating clarified instructions, structured reasoning steps, or domain-specific input formats directly into the model’s inputs. Contexts underpin many AI system components, including system prompts that guide downstream tasks [4, 36], memory that carries past facts and experiences [41, 48], and factual evidence that reduces hallucination and supplements knowledge [6]. Adapting through contexts rather than weights offers several key advantages. Contexts are interpretable and explainable for users and developers [45, 47], allow rapid integration of new knowledge at runtime [7, 27], and can be shared across models or modules in a compound system [23]. Meanwhile, advances in long- context LLMs [39] and context-efficient inference such as KV cache reuse [17, 51] are making context-based approaches increasingly practical for deployment. As a result, context adaptation is emerging as a central paradigm for building capable, scalable, and self-improving AI systems. Despite this progress, existing approaches to context adaptation face two key limitations. First, a brevity bias: many prompt optimizers prioritize concise, broadly applicable instructions over comprehensive accumulation. For example, GEPA [4] highlights brevity as a strength, but such abstraction can omit domain-specific heuristics, tool-use guidelines, or common failure modes that matter in practice [16]. This objective aligns with validation metrics in some settings, but often fails to capture the detailed strategies required by agents and knowledge-intensive applications. Second, context collapse: methods that rely on monolithic rewriting by an LLM often degrade into shorter, less informative summaries over time, causing sharp performance declines (Figure 2). In domains such as interactive agents [38, 43, 57], domain-specific programming [53, 56], and financial or legal analysis [18, 33, 44], strong performance depends on retaining detailed, task-specific knowledge rather than compressing it away. As applications such as agents and knowledge-intensive reasoning demand greater reliability, recent work has shifted toward saturating contexts with abundant, potentially useful information [11, 12, 22], enabled by advances in long-context LLMs [34, 39]. We argue that contexts should function not as concise summaries, but as comprehensive, evolving playbooks—detailed, inclusive, and rich with domain insights. Unlike humans, who often benefit from concise generalization, LLMs are more effective when provided with long, detailed contexts and can distill relevance autonomously [22, 31, 41]. Thus, instead of compressing away domain-specific heuristics and tactics, contexts should preserve them, allowing the model to decide what matters at inference time. To address these limitations, we introduce ACE (Agentic Context Engineering), a framework for compre- hensive context adaptation in both offline settings (e.g., system prompt optimization) and online settings (e.g., test-time memory adaptation). Rather than compressing contexts into distilled summaries, ACE treats them as evolving playbooks that accumulate and organize strategies over time. Building on the agentic architecture of Dynamic Cheatsheet [41], ACE incorporates a modular workflow of generation, reflection, and curation, while adding structured, incremental updates guided by a grow-and-refine principle. This design preserves detailed, domain-specific knowledge, prevents context collapse, and yields contexts that remain comprehensive and scalable throughout adaptation. We evaluate ACE on two categories of LLM applications that most benefit from comprehensive, evolving contexts: (1) agents [43], which require multi-turn reasoning, tool use, and environment interaction, where accumulated strategies can be reused across episodes; and (2) domain-specific benchmarks, which demand specialized tactics and knowledge, where we focus on financial analysis [33, 44]. Our key findings are: • ACE consistently outperforms strong baselines, yielding average gains of 10.6% on agents and 8.6% on domain-specific benchmarks, across both offline and online adaptation settings. • ACE is able to construct effective contexts without labeled supervision, instead leveraging execution feedback and environment signals—key ingredients for self-improving LLMs and agents. • On the AppWorld benchmark leaderboard [5], ACE matches the top-ranked production-level agent IBM- CUGA [35] (powered by GPT-4.1) on average and surpasses it on the harder test-challenge split, while using a smaller open-source model (DeepSeek-V3.1). 2

3. • ACE requires significantly fewer rollouts and lower dollar costs, and achieves 86.9% lower adaptation latency (on average) than existing adaptive methods, demonstrating that scalable self-improvement can be achieved with both higher accuracy and lower overhead. 2 2.1 Background and Motivation Context Adaptation Context adaptation (or context engineering) refers to methods that improve model behavior by constructing or modifying inputs to an LLM, rather than altering its weights. The current state of the art leverages natural language feedback [4, 40, 54]. In this paradigm, a language model inspects the current context along with signals such as execution traces, reasoning steps, or validation results, and generates natural language feedback on how the context should be revised. This feedback is then incorporated into the context, enabling iterative adaptation. Representative methods include Reflexion [40], which reflects on failures to improve agent planning; TextGrad [54], which optimizes prompts via gradient-like textual feedback; GEPA [4], which refines prompts iteratively based on execution traces and achieves strong performance, even surpassing reinforcement learning approaches in some settings; and Dynamic Cheatsheet [41], which constructs an external memory that accumulates strategies and lessons from past successes and failures during inference. These natural language feedback methods represent a major advance, offering flexible and interpretable signals for improving LLM systems beyond weight updates. 2.2 Limitations of Existing Context Adaptation Methods The Brevity Bias. A recurring limitation of context adaptation methods is brevity bias: the tendency of optimization to collapse toward short, generic prompts. Gao et al. [16] document this effect in prompt optimization for test generation, where iterative methods repeatedly produced near-identical instructions (e.g., "Create unit tests to ensure methods behave as expected"), sacrificing diversity and omitting domain- specific detail. This convergence not only narrows the search space but also propagates recurring errors across iterations, since optimized prompts often inherit the same faults as their seeds. More broadly, such bias undermines performance in domains that demand detailed, context-rich guidance—such as multi-step agents, program synthesis, or knowledge-intensive reasoning—where success hinges on accumulating rather than compressing task-specific insights. # Tokens: 18,282 Accuracy: 66.7 Accuracy w/o context: 63.7 # Tokens: 122 Accuracy: 57.1 Figure 2: Context Collapse. Monolithic rewriting of context by an LLM can collapse it into shorter, less informative summaries, leading to sharp performance drops. Context Collapse. In a case study on the AppWorld benchmark [43], we observe a phenomenon we call context collapse, which arises when an LLM is tasked with fully rewriting the accumulated context at each adaptation step. As the context grows large, the model tends to compress it into much shorter, less informative summaries, causing a dramatic loss of information. For instance, at step 60 the context contained 3

4. 18,282 tokens and achieved an accuracy of 66.7, but at the very next step it collapsed to just 122 tokens, with accuracy dropping to 57.1—worse than the baseline accuracy of 63.7 without adaptation. While we highlight this through Dynamic Cheatsheet [41], the issue is not specific to that method; rather, it reflects a fundamental risk of end-to-end context rewriting with LLMs, where accumulated knowledge can be abruptly erased instead of preserved. Figure 3: Example ACE-Generated Context on the AppWorld Benchmark (partially shown). ACE-generated contexts contain detailed, domain-specific insights along with tools and code that are readily usable, serving as a comprehensive playbook for LLM applications. 3 Agentic Context Engineering (ACE) We present ACE (Agentic Context Engineering), a framework for scalable and efficient context adaptation in both offline (e.g., system prompt optimization) and online (e.g., test-time memory adaptation) scenarios. Instead of condensing knowledge into terse summaries or static instructions, ACE treats contexts as evolving playbooks that continuously accumulate, refine, and organize strategies over time. Building on the agentic design of Dynamic Cheatsheet [41], ACE introduces a structured division of labor across three roles (Figure 4): the Generator, which produces reasoning trajectories; the Reflector, which distills concrete insights from successes and errors; and the Curator, which integrates these insights into structured context updates. This mirrors how humans learn—experimenting, reflecting, and consolidating—while avoiding the bottleneck of overloading a single model with all responsibilities. To address the limitations of prior methods discussed in §2.2—notably brevity bias and context collapse—ACE introduces three key innovations: (1) a dedicated Reflector that separates evaluation and insight extraction from curation, improving context quality and downstream performance (§4.5); (2) incremental delta updates (§3.1) that replace costly monolithic rewrites with localized edits, reducing both latency and compute cost (§4.6); and (3) a grow-and-refine mechanism (§3.2) that balances steady context expansion with redundancy control. 4

5. Iterative Refinement Query Insights Trajectory Generator Context Playbook Reflector Curator Update Delta Context Items Figure 4: The ACE Framework. Inspired by Dynamic Cheatsheet, ACE adopts an agentic architecture with three specialized components: a Generator, a Reflector, and a Curator. As shown in Figure 4, the workflow begins with the Generator producing reasoning trajectories for new queries, which surface both effective strategies and recurring pitfalls. The Reflector critiques these traces to extract lessons, optionally refining them across multiple iterations. The Curator then synthesizes these lessons into compact delta entries, which are merged deterministically into the existing context by lightweight, non-LLM logic. Because updates are itemized and localized, multiple deltas can be merged in parallel, enabling batched adaptation at scale. ACE further supports multi-epoch adaptation, where the same queries are revisited to progressively strengthen the context. 3.1 Incremental Delta Updates A core design principle of ACE is to represent context as a collection of structured, itemized bullets, rather than a single monolithic prompt. The concept of a bullet is similar to the concept of a memory entry in LLM memory frameworks like Dynamic Cheatsheet [41] and A-MEM [48], but builds on top of that and consists of (1) metadata, including a unique identifier and counters tracking how often it was marked helpful or harmful; and (2) content, capturing a small unit such as a reusable strategy, domain concept, or common failure mode. When solving new problems, the Generator highlights which bullets were useful or misleading, providing feedback that guides the Reflector in proposing corrective updates. This itemized design enables three key properties: (1) localization, so only the relevant bullets are updated; (2) fine-grained retrieval, so the Generator can focus on the most pertinent knowledge; and (3) incremental adaptation, allowing efficient merging, pruning, and de-duplication during inference. Rather than regenerating contexts in full, ACE incrementally produces compact delta contexts: small sets of candidate bullets distilled by the Reflector and integrated by the Curator. This avoids the computational cost and latency of full rewrites, while ensuring that past knowledge is preserved and new insights are steadily appended. As contexts grow, this approach provides the scalability needed for long-horizon or domain-intensive applications. 3.2 Grow-and-Refine Beyond incremental growth, ACE ensures that contexts remain compact and relevant through periodic or lazy refinement. In grow-and-refine, bullets with new identifiers are appended, while existing bullets are updated in place (e.g., incrementing counters). A de-duplication step then prunes redundancy by comparing bullets via semantic embeddings. This refinement can be performed proactively (after each delta) or lazily (only when the context window is exceeded), depending on application requirements for latency and accuracy. 5

6. Together, incremental updates and grow-and-refine maintain contexts that expand adaptively, remain interpretable, and avoid the potential variance introduced by monolithic context rewriting. 4 Results Our evaluation of ACE shows that: • Enabling High-Performance, Self-Improving Agents. ACE enables agents to self-improve by dynamically refining their input context. It boosts accuracy on the AppWorld benchmark by up to 17.1% by learning to engineer better contexts from execution feedback alone, without needing ground-truth labels. This context- driven improvement allows a smaller, open-source model to match the performance of the top-ranked proprietary agent on the leaderboard. (§4.3) • Large Gains on Domain-Specific Benchmarks. On complex financial reasoning benchmarks, ACE delivers an average performance gain of 8.6% over strong baselines by constructing comprehensive playbooks with domain-specific concepts and insights. (§4.4) • Effective by Design. Ablation studies confirm our design choices are key to success, with components like the Reflector and multi-epoch refinement each contributing substantial performance gains. (§4.5) • Lower Cost and Adaptation Latency. ACE achieves these gains efficiently, reducing adaptation latency by 86.9% on average, while requiring fewer rollouts and lower token dollar costs. (§4.6) 4.1 Tasks and Datasets We evaluate ACE on two categories of LLM applications that benefit most from a comprehensive and evolving context: (1) agent benchmarks, which require multi-turn reasoning, tool use, and environment interaction, where agents can accumulate and reuse strategies across episodes and environments; and (2) domain-specific benchmarks, which demand mastery of specialized concepts and tactics, where we focus on financial analysis as a case study. • LLM Agent: AppWorld [43] is a suite of autonomous agent tasks involving API understanding, code generation, and environment interaction. It provides a realistic execution environment with common applications and APIs (e.g., email, file system) and tasks of two difficulty levels (normal and challenge). A public leaderboard [5] tracks performance, where, at the time of submission, the best system achieved only 60.3% average accuracy, highlighting the benchmark’s difficulty and realism. • Financial Analysis: FiNER [33] and Formula [44] test LLMs on financial reasoning tasks that rely on the eXtensible Business Reporting Language (XBRL). FiNER requires labeling tokens in XBRL financial documents with one of 139 fine-grained entity types, a key step for financial information extraction in regulated domains. Formula focuses on extracting values from structured XBRL filings and performing computations to answer financial queries, i.e., numerical reasoning. Evaluation Metrics. For AppWorld, we follow the official benchmark protocol and report Task Goal Completion (TGC) and Scenario Goal Completion (SGC) on both the test-normal and test-challenge splits. For FiNER and Formula, we follow the original setup and report accuracy, measured as the proportion of predicted answers that exactly match the ground truth. All datasets follow the original train/validation/test splits. For offline context adaptation, methods are optimized on the training split and evaluated on the test split with pass@1 accuracy. For online context adaptation, methods are evaluated sequentially on the test split: for each sample, the model first predicts with the current context, then updates its context based on that sample. The same shuffled test split is used across all methods. 4.2 Baselines and Methods Base LLM. The base model is evaluated directly on each benchmark without any context engineering, using the default prompts provided by dataset authors. For AppWorld, we follow the official ReAct [52] 6

7. implementation released by the benchmark authors, and build all other baselines and methods on top of this framework. In-Context Learning (ICL) [3]. ICL provides the model with task demonstrations in the input prompt (few-shot or many-shot). This allows the model to infer the task format and desired output without weight updates. We supply all training samples when they fit within the model’s context window; otherwise, we fill the window with as many demonstrations as possible. MIPROv2 [36]. MIPROv2 is a popular prompt optimizer for LLM applications that works by jointly optimizing system instructions and in-context demonstrations via bayesian optimization. We use the official DSPy implementation [15], setting auto="heavy" to maximize optimization performance. GEPA [4]. GEPA (Genetic-Pareto) is a sample-efficient prompt optimizer based on reflective prompt evolution. It collects execution traces (reasoning, tool calls, intermediate outputs) and applies natural- language reflection to diagnose errors, assign credit, and propose prompt updates. A genetic Pareto search maintains a frontier of high-performing prompts, mitigating local optima. Empirically, GEPA outperforms reinforcement learning methods such as GRPO and prompt optimizers like MIPROv2, achieving up to 10–20% higher accuracy with as much as 35× fewer rollouts. We use the official DSPy implementation [14], setting auto="heavy" to maximize optimization performance. Dynamic Cheatsheet (DC) [41]. DC is a test-time learning approach that introduces an adaptive external memory of reusable strategies and code snippets. By continuously updating this memory with newly encountered inputs and outputs, DC enables models to accumulate knowledge and reuse it across tasks, often leading to substantial improvements over static prompting methods. A key advantage of DC is that it does not require ground-truth labels: the model can curate its own memory from its generations, making the method highly flexible and broadly applicable. We use the official implementation released by the authors [42] and set it to use the cumulative mode (DC-CU). ACE (ours). ACE optimizes LLM contexts for both offline and online adaptation through an agentic context engineering framework. To ensure fairness, we use the same LLM for the Generator, Reflector, and Curator (non-thinking mode of DeepSeek-V3.1 [13]), preventing knowledge transfer from a stronger Reflector or Curator to a weaker Generator. This isolates the benefit of context construction itself. We adopt a batch size of 1 (constructing a delta context from each sample). We set the maximum number of Reflector refinement rounds and the maximum number of epochs in offline adaptation to 5. 4.3 Results on Agent Benchmark Analysis. As shown in Table 1, ACE consistently improves over strong baselines on the AppWorld benchmark. In the offline setting, ReAct + ACE outperforms both ReAct + ICL and ReAct + GEPA by significant margins (12.3% and 11.9%, respectively), demonstrating that structured, evolving, and detailed contexts enable more effective agent learning than fixed demonstrations or single optimized instruction prompts. These gains extend to the online setting, where ACE continues to outperform prior adaptive methods such as Dynamic Cheatsheet by an average of 7.6%. In the agent use case, ACE remains effective even without access to ground-truth labels during adaptation: ReAct + ACE achieves an average improvement of 14.8% over the ReAct baseline in this setting. This robustness arises because ACE leverages signals naturally available during execution (e.g., code execution success or failure) to guide the Reflector and Curator in forming structured lessons of successes and failures. Together, these results establish ACE as a strong and versatile framework for building self-improving agents that adapt reliably both with and without labeled supervision. Notably, on the latest AppWorld leaderboard (as of September 20, 2025; Figure 5), on average, ReAct + ACE (59.4%) matches the top-ranked IBM CUGA (60.3%), a production-level GPT-4.1–based agent [35], despite using the smaller open-source model DeepSeek-V3.1. With online adaptation, ReAct + ACE even 7

8. Method Test-Normal GT Labels TGC ↑ Test-Challenge SGC ↑ TGC ↑ DeepSeek-V3.1 as Base LLM 63.7 42.9 41.5 ReAct Average SGC ↑ 21.6 42.4 ReAct + ICL ReAct + GEPA ReAct + ACE ReAct + ACE ✓ ✓ ✓ ✗ Offline Adaptation 64.3 + 0.6 46.4 + 3.5 46.0 + 4.5 64.9 + 1.2 44.6 + 1.7 46.0 + 4.5 76.2 + 12.5 64.3 + 21.4 57.3 + 15.8 75.0 + 11.3 64.3 + 21.4 54.4 + 12.9 27.3 + 5.7 30.2 + 8.6 39.6 + 18.0 35.2 + 13.6 46.0 + 3.6 46.4 + 4.0 59.4 + 17.0 57.2 + 14.8 ReAct + DC (CU) ReAct + ACE ✗ ✗ Online Adaptation 65.5 + 1.8 58.9 + 16.0 69.6 + 5.9 53.6 + 10.7 30.8 + 9.2 48.9 + 27.3 51.9 + 9.5 59.5 + 17.1 52.3 + 10.8 66.0 + 24.5 Table 1: Results on the AppWorld Agent Benchmark. "GT labels" indicates whether ground-truth labels are available to the Reflector during adaptation. We evaluate the ACE framework against multiple baselines on top of the official ReAct implementation, both for offline and online context adaptation. ReAct + ACE outperforms selected baselines by an average of 10.6%, and could achieve good performance even without access to GT labels. Method GT Labels DC (CU) DC (CU) ACE ACE Formula (Acc ↑ ) DeepSeek-V3.1 as Base LLM 70.7 67.5 Base LLM ICL MIPROv2 GEPA ACE ACE FINER (Acc ↑ ) Average 69.1 ✓ ✓ ✓ ✓ ✗ Offline Adaptation 72.3 + 1.6 72.4 + 1.7 73.5 + 2.8 78.3 + 7.6 71.1 + 0.4 67.0 − 0.5 69.5 + 2.0 71.5 + 4.0 85.5 + 18.0 83.0 + 15.5 69.6 + 0.5 70.9 + 1.8 72.5 + 3.4 81.9 + 12.8 77.1 + 8.0 ✓ ✗ ✓ ✗ Online Adaptation 74.2 + 3.5 69.5 + 2.0 68.3 − 2.4 62.5 − 5.0 76.7 + 6.0 76.5 + 9.0 67.3 − 3.4 78.5 + 11.0 71.8 + 2.7 65.4 − 3.7 76.6 + 7.5 72.9 + 3.8 Table 2: Results on Financial Analysis Benchmark. "GT labels" indicates whether ground-truth labels are available to the Reflector during adaptation. With GT labels, ACE outperforms selected baselines by an average of 8.6%, highlighting the advantage of structured and evolving contexts for domain-specific reasoning. However, we also observe that in the absence of reliable feedback signals (e.g., ground-truth labels or execution outcomes), both ACE and other adaptive methods such as Dynamic Cheatsheet may degrade, suggesting that context adaptation depends critically on feedback quality. surpasses IBM CUGA by 8.4% in TGC and 0.7% in SGC on the harder test-challenge split, underscoring the effectiveness of ACE in building comprehensive and self-evolving contexts for agents. 4.4 Results on Domain-Specific Benchmark Analysis. As shown in Table 2, ACE delivers strong improvements on financial analysis benchmarks. In the offline setting, when provided with ground-truth answers from the training split, ACE surpasses ICL, MIPROv2, and GEPA by clear margins (an average of 10.9%), showing that structured and evolving contexts are particularly effective when tasks require precise domain knowledge (e.g., financial concepts, 8

9. Method GT Labels Test-Normal TGC ↑ Test-Challenge SGC ↑ DeepSeek-V3.1 as Base LLM 63.7 42.9 ReAct Average TGC ↑ SGC ↑ 41.5 21.6 42.4 ReAct + ACE w/o Reflector or multi-epoch ReAct + ACE w/o multi-epoch ReAct + ACE Offline Adaptation ✓ 70.8 + 7.1 72.0 + 8.3 ✓ ✓ 76.2 + 12.5 55.4 + 12.5 60.7 + 17.8 64.3 + 21.4 55.9 + 14.4 54.9 + 13.4 57.3 + 15.8 38.1 + 17.5 39.6 + 18.0 39.6 + 18.0 55.1 + 12.7 56.8 + 14.4 59.4 + 17.0 ReAct + ACE ReAct + ACE + offline warmup Online Adaptation ✗ 67.9 + 4.2 69.6 + 5.9 ✗ 51.8 + 8.9 53.6 + 10.7 61.4 + 19.9 66.0 + 24.5 43.2 + 21.6 48.9 + 27.3 56.1 + 13.7 59.5 + 17.1 Table 3: Ablation Studies on AppWorld. We study how particular design choices of ACE (iterative refinement, multi-epoch adaptation, and offline warmup) could help high-quality context adaptation. Method ReAct + GEPA ReAct + ACE Latency (s) ↓ # Rollouts ↓ Method Latency (s) ↓ Token Cost ($) ↓ 53898 9517 (-82.3%) 1434 357 (-75.1%) DC (CU) ACE 65104 5503 (-91.5%) 17.7 2.9 (-83.6%) (b) Online (FiNER). (a) Offline (AppWorld). Table 4: Cost and Speed Analysis. We measure the context adaptation latency, number of rollouts, and dollar costs of ACE against GEPA (offline) and DC (online). XBRL rules) that goes beyond fixed demonstrations or monolithic optimized prompts. In the online setting, ACE continues to exceed prior adaptive methods such as DC by an average of 6.2%, further confirming the benefit of agentic context engineering for accumulating reusable insights across specialized domains. Moreover, we also observe that when ground-truth supervision or reliable execution signals are absent, both ACE and DC may degrade in performance. In such cases, the constructed context can be polluted by spurious or misleading signals, highlighting a potential limitation of inference-time adaptation without reliable feedback. This suggests that while ACE is robust under rich feedback (e.g., code execution results or formula correctness in agent tasks), its effectiveness depends on the availability of signals that allow the Reflector and Curator to make sound judgments. We return to this limitation in Appendix B. 4.5 Ablation Study Table 3 reports ablation studies on the AppWorld benchmark, analyzing how individual design choices of ACE contribute to effective context adaptation. We examine three factors: (1) the Reflector with iterative refinement, our addition to the agentic framework beyond Dynamic Cheatsheet, (2) multi-epoch adaptation, which refines contexts over training samples multiple times, and (3) offline warmup, which initializes the context through offline adaptation before online adaptation begins. 4.6 Cost and Speed Analysis Due to its support for incremental, “delta" context updates and non-LLM-based context merging and de- duplication, ACE demonstrates particular advantages in reducing the cost (in terms of the number of rollouts or the amount of dollar cost for token ingestion/generation) and latency of adaptation. As examples, on the offline adaptation of AppWorld, ACE achieves 82.3% reduction in adaptation latency and 75.1% reduction in the number of rollouts as compared to GEPA (Table 4(a)). On the online adaptation 9

10. of FiNER, ACE achieves 91.5% reduction in adaptation latency and 83.6% reduction in token dollar cost for token ingestion and generation as compared to DC (Table 4(b)). 5 Discussion Longer Context ̸ = Higher Serving Cost. Although ACE produces longer contexts than methods such as GEPA, this does not translate to linearly higher inference cost or GPU memory usage. Modern serving infrastructures are increasingly optimized for long-context workloads through techniques such as the reuse [17, 51], compression [30, 32], and offload [25] of KV cache. These mechanisms allow frequently reused context segments to be cached locally or remotely, avoiding repetitive and expensive prefill operations. Ongoing advances in ML systems suggest that the amortized cost of handling long contexts will continue to decrease, making context-rich approaches like ACE increasingly practical in deployment. Implications for Online and Continuous Learning. Online and continuous learning are key research directions in machine learning for addressing issues like distribution shifts [19, 24] and limited training data [21, 37, 60]. ACE offers a flexible and efficient alternative to conventional model fine-tuning, as adapting contexts is generally cheaper than updating model weights [9, 20, 26, 28]. Moreover, because contexts are human-interpretable, ACE enables selective unlearning [8, 10, 29]—whether due to privacy or legal constraints [1, 2], or when outdated or incorrect information is identified by domain experts. These are promising directions for future work, where ACE could play a central role in advancing continuous and responsible learning. References [1] General Data Protection Regulation article 17: Right to erasure. EU Regulation 2016/679, 2016. Official consolidated text. [2] California consumer privacy act, civil code §1798.105: Right to delete. State of California Civil Code, 2018. [3] Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, et al. Many-shot in-context learning. Advances in Neural Information Processing Systems, 37:76930–76966, 2024. [4] Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning. arXiv preprint arXiv:2507.19457, 2025. [5] AppWorld. Leaderboard. https://appworld.dev/leaderboard, 2025. Accessed: 2025-09-20. [6] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. 2024. [7] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022. [8] Lucas Bourtoule, Varun Chandrasekaran, Christopher Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. IEEE Symposium on Security and Privacy, pages 141–159, 2021. [9] Tom Brown et al. Language models are few-shot learners. In NeurIPS, 2020. [10] Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In IEEE Symposium on Security and Privacy, 2015. 10

11. [11] Tianxiang Chen, Zhentao Tan, Xiaofan Bo, Yue Wu, Tao Gong, Qi Chu, Jieping Ye, and Nenghai Yu. Flora: Effortless context construction to arbitrary length and scale. arXiv preprint arXiv:2507.19786, 2025. [12] Yeounoh Chung, Gaurav T Kakkar, Yu Gan, Brenton Milne, and Fatma Ozcan. Is long context all you need? leveraging llm’s extended context for nl2sql. arXiv preprint arXiv:2501.12372, 2025. [13] DeepSeek-AI. Deepseek-v3 technical report, 2024. [14] DSPy. dspy.gepa: Reflective prompt optimizer. https://dspy.ai/api/optimizers/GEPA/overview/, 2025. Accessed: 2025-09-24. [15] DSPy. dspy.miprov2. https://dspy.ai/api/optimizers/MIPROv2/, 2025. Accessed: 2025-09-24. [16] Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael Lyu. The prompt alchemist: Automated llm-tailored prompt optimization for test case genera- tion. arXiv preprint arXiv:2501.01329, 2025. [17] In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems, 6:325–338, 2024. [18] Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, Diego Zambrano, et al. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. Advances in neural information processing systems, 36:44123–44279, 2023. [19] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In ICLR, 2021. [20] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. arXiv:2106.09685, 2021. [21] Maxwell L Hutchinson, Erin Antono, Brenna M Gibbons, Sean Paradiso, Julia Ling, and Bryce Meredig. Overcoming data scarcity with transfer learning. arXiv preprint arXiv:1711.05099, 2017. [22] Mingjian Jiang, Yangjun Ruan, Luis Lastras, Pavan Kapanipathi, and Tatsunori Hashimoto. Putting it all into context: Simplifying agents with lclms. arXiv preprint arXiv:2505.08120, 2025. [23] Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022. [24] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubra- mani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pages 5637–5664. PMLR, 2021. [25] Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. { InfiniGen } : Efficient generative inference of large language models with dynamic { KV } cache management. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 155–172, 2024. [26] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In EMNLP, 2021. [27] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020. [28] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. ACL, 2021. 11

12. [29] Shiyang Liu et al. Rethinking machine unlearning for large language models. arXiv:2402.08787, 2024. [30] Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, pages 38–56, 2024. [31] Zhining Liu, Rana Ali Amjad, Ravinarayana Adkathimar, Tianxin Wei, and Hanghang Tong. Selfelicit: Your language model secretly knows where is the relevant evidence. arXiv preprint arXiv:2502.08767, 2025. [32] Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024. [33] Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos, and Georgios Paliouras. Finer: Financial numeric entity recognition for xbrl tagging. arXiv preprint arXiv:2203.06482, 2022. [34] Yansheng Mao, Jiaqi Li, Fanxu Meng, Jing Xiong, Zilong Zheng, and Muhan Zhang. Lift: Improving long context understanding through long input fine-tuning. arXiv preprint arXiv:2412.13626, 2024. [35] Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Offer Akrabi, Aviad Sela, Asaf Adi, and Nir Mashkif. Towards enterprise-ready computer using generalist agent. arXiv preprint arXiv:2503.01861, 2025. [36] Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. arXiv preprint arXiv:2406.11695, 2024. [37] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010. [38] Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37:126544–126565, 2024. [39] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023. [40] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634–8652, 2023. [41] Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. arXiv preprint arXiv:2504.07952, 2025. [42] Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. https://github.com/suzgunmirac/dynamic-cheatsheet, 2025. Accessed: 2025-09-24. [43] Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. arXiv preprint arXiv:2407.18901, 2024. [44] Dannong Wang, Jaisal Patel, Daochen Zha, Steve Y Yang, and Xiao-Yang Liu. Finlora: Benchmarking lora methods for fine-tuning llms on financial datasets. arXiv preprint arXiv:2505.19819, 2025. [45] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. 12

13. [46] Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024. [47] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. [48] Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110, 2025. [49] John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37:50528–50652, 2024. [50] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018. [51] Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. In Proceedings of the Twentieth European Conference on Computer Systems, pages 94–109, 2025. [52] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. [53] Jiacheng Ye, Chengzu Li, Lingpeng Kong, and Tao Yu. Generating data for symbolic language with large language models. arXiv preprint arXiv:2305.13917, 2023. [54] Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024. [55] Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound ai systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/, 2024. [56] Genghan Zhang, Weixin Liang, Olivia Hsu, and Kunle Olukotun. Adaptive self-improvement llm agentic system for ml library development. arXiv preprint arXiv:2502.02534, 2025. [57] Qizheng Zhang, Ali Imran, Enkeleda Bardhi, Tushar Swamy, Nathan Zhang, Muhammad Shahbaz, and Kunle Olukotun. Caravan: Practical online learning of { In-Network }{ ML } models with labeling agents. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 325–345, 2024. [58] Qizheng Zhang, Michael Wornow, and Kunle Olukotun. Cost-efficient serving of llm agents via test-time plan caching. arXiv preprint arXiv:2506.14852, 2025. [59] Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Agentfly: Fine-tuning llm agents without fine-tuning llms. arXiv preprint arXiv:2508.16153, 2025. [60] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. arXiv:1911.02685, 2019. 13

14. A Related Work on Agent Memory A growing body of work explores how agents can accumulate experience from past trajectories and leverage external (often non-parametric) memory to guide future actions. AgentFly [59] presents an extensible frame- work where memory evolves continuously as agents solve tasks, enabling scalable reinforcement learning and long-horizon reasoning across diverse environments. AWM (Agent Workflow Memory) [46] induces reusable workflows—structured routines distilled from past trajectories—and selectively injects them into memory to improve efficiency and generalization in web navigation benchmarks. A-MEM [48] introduces a dynamically organized memory system inspired by the Zettelkasten method: each stored memory is annotated with structured attributes (e.g., tags, keywords, contextual descriptions) and automatically linked to relevant past entries, while existing entries are updated to integrate new knowledge, yielding adaptive and context-aware retrieval. Agentic Plan Caching [58] instead focuses on cost efficiency by extracting reusable plan templates from agent trajectories and caching them for fast execution at test time. Together, these works demonstrate the value of external memory for improving adaptability, efficiency, and generalization in LLM agents. Our work differs by tackling the broader challenge of context adaptation, which spans not only agent memory but also system prompts, factual evidence, and other inputs underpinning AI systems. We further highlight two fundamental limitations of existing adaptation methods—brevity bias and context collapse—and show that addressing them is essential for robustness, reliability, and scalability beyond raw task performance. Accordingly, our evaluation considers not only accuracy but also cost, latency, and scalability. B Limitations and Challenges A potential limitation of ACE is its reliance on a reasonably strong Reflector: if the Reflector fails to extract meaningful insights from generated traces or outcomes, the constructed context may become noisy or even harmful. In domain-specific tasks where no model can extract useful insights, the resulting context will naturally lack them. This dependency is similar to Dynamic Cheatsheet [41], where the quality of adaptation hinges on the underlying model’s ability to curate memory. We also note that not all applications require rich or detailed contexts. Tasks like HotPotQA [50] often benefit more from concise, high-level instructions (e.g., how to retrieve and synthesize evidence) than from long contexts. Similarly, games with fixed strategies such as Game of 24 [41] may only need a single reusable rule, rendering additional context redundant. Overall, ACE is most beneficial in settings that demand detailed domain knowledge, complex tool use, or environment-specific strategies that go beyond what is already embedded in model weights or simple system instructions. C AppWorld Leaderboard Snapshot (09/2025) Figure 5: The AppWorld leaderboard as accessed on 09/20/2025. 14

15. D Prompts We release the language model prompts used in our agentic context engineering framework as well as the baselines to support research transparency and reproducibility. I am your supervisor and you are a super intelligent AI Assistant whose job is to achieve my day-to-day tasks completely autonomously. To do this, you will need to interact with app/s (e.g., spotify, venmo etc) using their associated APIs on my behalf. For this you will undertake a multi-step conversation using a python REPL environment. That is, you will write the python code and the environment will execute it and show you the result, based on which, you will write python code for the next step and so on, until you’ve achieved the goal. This environment will let you interact with app/s using their associated APIs on my behalf. Here are three key APIs that you need to know to get more information # To get a list of apps that are available to you. print(apis.api_docs.show_app_descriptions()) # To get the list of apis under any app listed above, e.g. spotify print(apis.api_docs.show_api_descriptions(app_name='spotify')) # To get the specification of a particular api, e.g. spotify app's login api print(apis.api_docs.show_api_doc(app_name='spotify', api_name='login')) Each code execution will produce an output that you can use in subsequent calls. Using these APIs, you can now generate code, that I will execute, to solve the task. Let’s start with the task [3 shot example] Key instructions: 1. Make sure to end code blocks with ``` followed by a newline(). 2. Remember you can use the variables in your code in subsequent code blocks. 3. Remember that the email addresses, access tokens and variables (e.g. spotify_password) in the example above are not valid anymore. 4. You can use the “supervisor” app to get information about my accounts and use the “phone” app to get information about friends and family. 5. Always look at API specifications (using apis.api_docs.show_api_doc) before calling an API. 6. Write small chunks of code and only one chunk of code in every step. Make sure everything is working correctly before making any irreversible change. 7. Many APIs return items in “pages”. Make sure to run through all the pages by looping over page_index. 8. Once you have completed the task, make sure to call apis.supervisor.complete_task(). If the task asked for some information, return it as the answer argument, i.e. call apis.supervisor.complete_task(answer=<answer>). Many tasks do not require an answer, so in those cases, just call apis.supervisor.complete_task() i.e. do not pass any argument. Using these APIs, generate code to solve the actual task: My name is: {{ main_user.first_name }} {{ main_user.last_name }}. My personal email is {{ main_user.email }} and phone number is {{ main_user.phone_number }}. Task: {{ input_str }} Figure 6: ICL-baseline Generator prompt on AppWorld 15

16. I am your supervisor and you are a super intelligent AI Assistant whose job is to achieve my day-to-day tasks completely autonomously. You will be given a cheatsheet containing relevant strategies, patterns, and examples from similar problems to apply and solve the current task. To do this, you will need to interact with app/s (e.g., spotify, venmo etc) using their associated APIs on my behalf. For this you will undertake a multi-step conversation using a python REPL environment. That is, you will write the python code and the environment will execute it and show you the result, based on which, you will write python code for the next step and so on, until you’ve achieved the goal. This environment will let you interact with app/s using their associated APIs on my behalf. Here are three key APIs that you need to know to get more information # To get a list of apps that are available to you. print(apis.api_docs.show_app_descriptions()) # To get the list of apis under any app listed above, e.g. spotify print(apis.api_docs.show_api_descriptions(app_name='spotify')) # To get the specification of a particular api, e.g. spotify app's login api print(apis.api_docs.show_api_doc(app_name='spotify', api_name='login')) Each code execution will produce an output that you can use in subsequent calls. Using these APIs, you can now generate code, that I will execute, to solve the task. CHEATSHEET: ’’’ {{ cheat_sheet }} ’’’ 1. ANALYSIS & STRATEGY Carefully analyze both the question and cheatsheet before starting Search for and identify any applicable patterns, strategies, or examples within the cheatsheet Create a structured approach to solving the problem at hand Review and document any limitations in the provided reference materials 2. SOLUTION DEVELOPMENT Present your solution using clear, logical steps that others can follow and review Explain your reasoning and methodology before presenting final conclusions Provide detailed explanations for each step of the process Check and verify all assumptions and intermediate calculations 3. PROGRAMMING TASKS When coding is required: - Write clean, efficient Python code - Follow the strict code formatting and execution protocol (always use the Python code formatting block; furthermore, after the code block, always explicitly request execution by appending: “EXECUTE CODE!”): python # Your code here EXECUTE CODE! All required imports and dependencies should be clearly declared at the top of your code Include clear inline comments to explain any complex programming logic Perform result validation after executing your code Apply optimization techniques from the cheatsheet when applicable The code should be completely self-contained without external file dependencies–it should be ready to be executed right away Do not include any placeholders, system-specific paths, or hard-coded local paths Feel free to use standard and widely-used pip packages Opt for alternative methods if errors persist during execution Exclude local paths and engine-specific settings (e.g., avoid configurations like chess.engine.SimpleEngine.popen_uci(“/usr/bin/stockfish”)) Let’s start with the task [3 shot example] Key instructions: (1) Make sure to end code blocks with ``` followed by a newline(). 2. Remember you can use the variables in your code in subsequent code blocks. 3. Remember that the email addresses, access tokens and variables (e.g. spotify_password) in the example above are not valid anymore. 4. You can use the “supervisor” app to get information about my accounts and use the “phone” app to get information about friends and family. 5. Always look at API specifications (using apis.api_docs.show_api_doc) before calling an API. 6. Write small chunks of code and only one chunk of code in every step. Make sure everything is working correctly before making any irreversible change. 7. Many APIs return items in “pages”. Make sure to run through all the pages by looping over page_index. 8. Once you have completed the task, make sure to call apis.supervisor.complete_task(). If the task asked for some information, return it as the answer argument, i.e. call apis.supervisor.complete_task(answer=<answer>). Many tasks do not require an answer, so in those cases, just call apis.supervisor.complete_task() i.e. do not pass any argument. Using these APIs, generate code to solve the actual task: My name is: {{ main_user.first_name }} {{ main_user.last_name }}. My personal email is {{ main_user.email }} and phone number is {{ main_user.phone_number }}. Task: {{ input_str }} Figure 7: Dynamic Cheatsheet Generator prompt on AppWorld 16

17. I am your supervisor and you are a super intelligent AI Assistant whose job is to achieve my day-to-day tasks completely autonomously. To do this, you will need to interact with app/s (e.g., spotify, venmo etc) using their associated APIs on my behalf. For this you will undertake a multi-step conversation using a python REPL environment. That is, you will write the python code and the environment will execute it and show you the result, based on which, you will write python code for the next step and so on, until you’ve achieved the goal. This environment will let you interact with app/s using their associated APIs on my behalf. Here are three key APIs that you need to know to get more information: # To get a list of apps that are available to you. print(apis.api_docs.show_app_descriptions()) # To get the list of apis under any app listed above, e.g. spotify print(apis.api_docs.show_api_descriptions(app_name='spotify')) # To get the specification of a particular api, e.g. spotify app's login api print(apis.api_docs.show_api_doc(app_name='spotify', api_name='login')) Each code execution will produce an output that you can use in subsequent calls. Using these APIs, you can now generate code, that I will execute, to solve the task. Key Instructions: 1. Always end code blocks with ``` followed by a newline(). 2. Remember you can use variables in your code in subsequent code blocks. 3. Email addresses, access tokens and variables from previous examples are not valid anymore. 4. Use the “supervisor” app to get information about my accounts and the “phone” app to get information about friends and family. 5. Always look at API specifications (using apis.api_docs.show_api_doc) before calling an API. 6. Write small chunks of code and only one chunk of code in every step. Make sure everything is working correctly before making any irreversible changes. 7. Many APIs return items in “pages”. Make sure to run through all the pages by looping over page_index. 8. Once you have completed the task, call apis.supervisor.complete_task(). If the task asked for information, return it as the answer argument: apis.supervisor.complete_task(answer=<answer>). For tasks without required answers, just call apis.supervisor.complete_task() without arguments. Domain-Specific Strategy for Bill Splitting Tasks: When splitting bills among roommates, remember to: - First identify roommates using phone app’s search_contacts with “roommate” relationship query - Access bill receipts in file system under “/home/[username]/bills/” directory structure - Calculate equal shares by dividing total amount by (number of roommates + 1) including yourself - Use Venmo’s create_payment_request API with roommates’ email addresses - Ensure payment requests are only sent to actual roommates (not coworkers or other contacts) - Verify that all roommates have the same home address in their contact information - Use the description “I paid for cable bill.” for payment requests Domain-Specific Strategy for File Organization Tasks: When organizing files based on creation dates, remember to: - First login to the file system using credentials from supervisor - Use show_directory() to list files and show_file() to get file metadata including created_at - Create destination directories using create_directory() before moving files - Use move_file() to organize files while maintaining original filenames - Files created in specific months should be moved to corresponding destination directories (e.g., March → Rome, April → Santorini, others → Berlin) Domain-Specific Strategy for Music Playlist Tasks: When creating playlists for specific durations, remember to: - Calculate total duration needed (e.g., 90 minutes = 5400 seconds) - Search for appropriate songs across different genres (workout, energetic, rock, pop, dance) - Use show_song() to get individual song durations - Add songs to playlist until total duration requirement is met - Use play_music() with playlist_id to start playback Domain-Specific Strategy for File Compression Tasks: When compressing vacation photo directories, remember to: - Compress each vacation spot directory individually - Save compressed files in the specified destination path format (e.g., “~/photographs/vacations/.zip”) - Delete the original directories after successful compression - Verify that the compressed files are created in the correct location Domain-Specific Strategy for Alarm Management Tasks: When modifying phone alarms, remember to: - Identify the specific alarm by its label (e.g., “Wake Up”) - Calculate new times accurately (convert HH:MM to minutes for arithmetic operations) - Disable all other enabled alarms except the one being modified - Preserve all other alarm settings while making changes Domain-Specific Strategy for Message Management Tasks: When handling text/voice messages, remember to: - Use search functions to find specific messages by phone number or content - Handle pagination to ensure all relevant messages are processed - Delete messages using their specific message IDs - Verify deletion by checking that no messages remain Let’s start with the task: Figure 8: GEPA prompt on AppWorld 17

18. I am your supervisor and you are a super intelligent AI Assistant whose job is to achieve my day-to-day tasks completely autonomously. To do this, you will need to interact with app/s (e.g., spotify, venmo etc) using their associated APIs on my behalf. For this you will undertake a multi-step conversation using a python REPL environment. That is, you will write the python code and the environment will execute it and show you the result, based on which, you will write python code for the next step and so on, until you’ve achieved the goal. This environment will let you interact with app/s using their associated APIs on my behalf. Here are three key APIs that you need to know to get more information # To get a list of apps that are available to you. print(apis.api_docs.show_app_descriptions()) # To get the list of apis under any app listed above, e.g. spotify print(apis.api_docs.show_api_descriptions(app_name='spotify')) # To get the specification of a particular api, e.g. spotify app's login api print(apis.api_docs.show_api_doc(app_name='spotify', api_name='login')) Each code execution will produce an output that you can use in subsequent calls. Using these APIs, you can now generate code, that I will execute, to solve the task. You are also provided with a curated cheatsheet of strategies, API-specific information, common mistakes, and proven solutions to help you solve the task effectively. ACE Playbook: - Read the Playbook first, then execute the task by explicitly leveraging each relevant section: PLAYBOOK_BEGIN {{ playbook }} PLAYBOOK_END Let’s start with the task [3 shot example] Key instructions: 1. Make sure to end code blocks with ``` followed by a newline(). 2. Remember you can use the variables in your code in subsequent code blocks. 3. Remember that the email addresses, access tokens and variables (e.g. spotify_password) in the example above are not valid anymore. 4. You can use the “supervisor” app to get information about my accounts and use the “phone” app to get information about friends and family. 5. Always look at API specifications (using apis.api_docs.show_api_doc) before calling an API. 6. Write small chunks of code and only one chunk of code in every step. Make sure everything is working correctly before making any irreversible change. 7. Many APIs return items in “pages”. Make sure to run through all the pages by looping over page_index. 8. Once you have completed the task, make sure to call apis.supervisor.complete_task(). If the task asked for some information, return it as the answer argument, i.e. call apis.supervisor.complete_task(answer=<answer>). Many tasks do not require an answer, so in those cases, just call apis.supervisor.complete_task() i.e. do not pass any argument. 9. Treat the cheatsheet as a tool. Use only the parts that are relevant and applicable to your specific situation and task context, otherwise use your own judgement. Using these APIs and cheatsheet, generate code to solve the actual task: My name is: {{ main_user.first_name }} {{ main_user.last_name }}. My personal email is {{ main_user.email }} and phone number is {{ main_user.phone_number }}. Task: {{ input_str }} Figure 9: ACE Generator prompt on AppWorld 18

19. You are an expert AppWorld coding agent and educator. Your job is to diagnose the current trajectory: identify what went wrong (or could be better), grounded in execution feedback, API usage, unit test report, and ground truth when applicable. Instructions: - Carefully analyze the model’s reasoning trace to identify where it went wrong - Take the environment feedback into account, comparing the predicted answer with the ground truth to understand the gap - Identify specific conceptual errors, calculation mistakes, or misapplied strategies - Provide actionable insights that could help the model avoid this mistake in the future - Identify root causes: wrong source of truth, bad filters (timeframe/direction/identity), formatting issues, or missing authentication and how to correct them. - Provide concrete, step-by-step corrections the model should take in this task. - Be specific about what the model should have done differently - You will receive bulletpoints that are part of playbook that’s used by the generator to answer the question. - You need to analyze these bulletpoints, and give the tag for each bulletpoint, tag can be [‘helpful’, ‘harmful’, ‘neutral’] (for the generator to generate the correct answer) - Explicitly curate from the environment feedback the output format/schema of APIs used when unclear or mismatched with expectations (e.g., apis.blah.show_contents() returns a list of content_ids (strings), not content objects) Inputs: Ground truth code (reference, known-correct): GROUND_TRUTH_CODE_START {{ground_truth_code}} GROUND_TRUTH_CODE_END Test report (unit tests result for the task after the generated code was run): TEST_REPORT_START {{unit_test_results}} TEST_REPORT_END ACE playbook (playbook that’s used by model for code generation): PLAYBOOK_START {{playbook}} PLAYBOOK_END Examples: Example 1: Ground Truth Code: [Code that uses apis.phone.search_contacts() to find roommates, then filters Venmo transactions] Generated Code: [Code that tries to identify roommates by parsing Venmo transaction descriptions using keywords like “rent”, “utilities”] Execution Error: AssertionError: Expected 1068.0 but got 79.0 Test Report: FAILED - Wrong total amount calculated due to incorrect roommate identification Response: {{ “reasoning”: “The generated code attempted to identify roommates by parsing Venmo transaction descriptions rather than using the authoritative Phone app contacts. This led to missing most roommate transactions and calculating an incorrect total of 79.0 instead of 1068.0.”, “error_identification”: “The agent used unreliable heuristics (keyword matching in transaction descriptions) to identify roommates instead of the correct API (Phone contacts).”, “root_cause_analysis”: “The agent misunderstood the data architecture - it assumed transaction descriptions contained reliable relationship information, when the Phone app is the authoritative source for contact relationships.”, “correct_approach”: “First authenticate with Phone app, use apis.phone.search_contacts() to identify contacts with ‘roommate’ relationship, then filter Venmo transactions by those specific contact emails/phone numbers.”, “key_insight”: “Always resolve identities from the correct source app - Phone app for relationships, never rely on transaction descriptions or other indirect heuristics which are unreliable.” }} Example 2: Ground Truth Code: [Code that uses proper while True pagination loop to get all Spotify playlists] Generated Code: [Code that uses for i in range(10) to paginate through playlists] Execution Error: None (code ran successfully) Test Report: FAILED - Expected 23 playlists but got 10 due to incomplete pagination Response: {{ “reasoning”: “The generated code used a fixed range loop (range(10)) for pagination instead of properly iterating until no more results are returned. This caused the agent to only collect the first 10 pages of playlists, missing 13 additional playlists that existed on later pages.”, “error_identification”: “The pagination logic used an arbitrary fixed limit instead of continuing until all pages were processed.”, “root_cause_analysis”: “The agent used a cautious approach with a fixed upper bound to avoid infinite loops, but this prevented complete data collection when the actual data exceeded the arbitrary limit.”, “correct_approach”: “Use while True loop with proper break condition: continue calling the API with incrementing page_index until the API returns empty results or null, then break.”, “key_insight”: “For pagination, always use while True loop instead of fixed range iterations to ensure complete data collection across all available pages.” }} Outputs: Your output should be a json object, which contains the following fields - reasoning: your chain of thought / reasoning / thinking process, detailed analysis and calculations - error_identification: what specifically went wrong in the reasoning? - root_cause_analysis: why did this error occur? What concept was misunderstood? - correct_approach: what should the model have done instead? - key_insight: what strategy, formula, or principle should be remembered to avoid this error? Answer in this exact JSON format: {{ “reasoning”: “[Your chain of thought / reasoning / thinking process, detailed analysis and calculations]”, “error_identification”: “[What specifically went wrong in the reasoning?]”, “root_cause_analysis”: “[Why did this error occur? What concept was misunderstood?]”, “correct_approach”: “[What should the model have done instead?]”, “key_insight”: “[What strategy, formula, or principle should be remembered to avoid this error?]”, }} [FULL AGENT-ENVIRONMENT TRAJECTORY ATTACHED HERE] Figure 10: ACE Reflector prompt on AppWorld 19

20. You are a master curator of knowledge. Your job is to identify what new insights should be added to an existing playbook based on a reflection from a previous attempt. Context: - The playbook you created will be used to help answering similar questions. - The reflection is generated using ground truth answers that will NOT be available when the playbook is being used. So you need to come up with content that can aid the playbook user to create predictions that likely align with ground truth. Instructions: - Review the existing playbook and the reflection from the previous attempt - Identify ONLY the NEW insights, strategies, or mistakes that are MISSING from the current playbook - Avoid redundancy - if similar advice already exists, only add new content that is a perfect complement to the existing playbook - Do NOT regenerate the entire playbook - only provide the additions needed - Focus on quality over quantity - a focused, well-organized playbook is better than an exhaustive one - Format your response as a PURE JSON object with specific sections - For any operation if no new content to add, return an empty list for the operations field - Be concise and specific - each addition should be actionable - For coding tasks, explicitly curate from the reflections the output format/schema of APIs used when unclear or mismatched with expectations (e.g., apis.blah.show_contents() returns a list of content_ids (strings), not content objects) Task Context (the actual task instruction): {question_context} Current Playbook: {current_playbook} Current Generated Attempt (latest attempt, with reasoning and planning): {final_generated_code} Current Reflections (principles and strategies that helped to achieve current task): {guidebook} Examples: Example 1: Task Context: “Find money sent to roommates since Jan 1 this year” Current Playbook: [Basic API usage guidelines] Generated Attempt: [Code that failed because it used transaction descriptions to identify roommates instead of Phone contacts] Reflections: “The agent failed because it tried to identify roommates by parsing Venmo transaction descriptions instead of using the Phone app’s contact relationships. This led to incorrect identification and wrong results.” Response: { "reasoning": "The reflection shows a critical error where the agent used unreliable heuristics (transaction descriptions) instead of the authoritative source (Phone app contacts) to identify relationships. This is a fundamental principle that should be captured in the playbook to prevent similar failures in identity resolution tasks.", "operations": [ { "type": "ADD", "section": "strategies_and_hard_rules", "content": "Always resolve identities from the correct source app\n- When you need to identify relationships (roommates, contacts, etc.), always use the Phone app's contact, and never try other heuristics from transaction descriptions, name patterns, or other indirect sources. These heuristics are unreliable and will cause incorrect results." } ] } Example 2: Task Context: “Count all playlists in Spotify” Current Playbook: [Basic authentication and API calling guidelines] Generated Attempt: [Code that used for i in range(10) loop and missed playlists on later pages] Reflections: “The agent used a fixed range loop for pagination instead of properly iterating through all pages until no more results are returned. This caused incomplete data collection.” Response: { "reasoning": "The reflection identifies a pagination handling error where the agent used an arbitrary fixed range instead of proper pagination logic. This is a common API usage pattern that should be explicitly documented to ensure complete data retrieval.", "operations": [ { "type": "ADD", "section": "apis_to_use_for_specific_information", "content": "About pagination: many APIs return items in \"pages\". Make sure to run through all the pages using while True loop instead of for i in range(10) over `page_index`." } ] } Your Task: Output ONLY a valid JSON object with these exact fields: - reasoning: your chain of thought / reasoning / thinking process, detailed analysis and calculations - operations: a list of operations to be performed on the playbook - type: the type of operation to be performed - section: the section to add the bullet to - content: the new content of the bullet Available Operations: 1. ADD: Create new bullet points with fresh IDs - section: the section to add the new bullet to - content: the new content of the bullet. Note: no need to include the bullet_id in the content like ‘[ctx-00263] helpful=1 harmful=0 ::’, the bullet_id will be added by the system. RESPONSE FORMAT - Output ONLY this JSON structure (no markdown, no code blocks): { "reasoning": "[Your chain of thought / reasoning / thinking process, detailed analysis and calculations here]", "operations": [ { "type": "ADD", "section": "verification_checklist", "content": "[New checklist item or API schema clarification...]" } ] } Figure 11: ACE Curator prompt on AppWorld 20

21. You are an analysis expert tasked with answering questions using your knowledge, a curated playbook of strategies and insights and a reflection that goes over the diagnosis of all previous mistakes made while answering the question. Instructions: - Read the playbook carefully and apply relevant strategies, formulas, and insights - Pay attention to common mistakes listed in the playbook and avoid them - Show your reasoning step-by-step - Be concise but thorough in your analysis - If the playbook contains relevant code snippets or formulas, use them appropriately - Double-check your calculations and logic before providing the final answer Your output should be a json object, which contains the following fields: - reasoning: your chain of thought / reasoning / thinking process, detailed analysis and calculations - bullet_ids: each line in the playbook has a bullet_id. all bulletpoints in the playbook that’s relevant, helpful for you to answer this question, you should include their bullet_id in this list - final_answer: your concise final answer Playbook: {} Reflection: {} Question: {} Context: {} Answer in this exact JSON format: { "reasoning": "[Your chain of thought / reasoning / thinking process, detailed analysis and calculations]", "bullet_ids": ["calc-00001", "fin-00002"], "final_answer": "[Your concise final answer here]" } Figure 12: ACE Generator prompt on FINER 21

22. You are an expert analyst and educator. Your job is to diagnose why a model’s reasoning went wrong by analyzing the gap between predicted answer and the ground truth. Instructions: - Carefully analyze the model’s reasoning trace to identify where it went wrong - Take the environment feedback into account, comparing the predicted answer with the ground truth to understand the gap - Identify specific conceptual errors, calculation mistakes, or misapplied strategies - Provide actionable insights that could help the model avoid this mistake in the future - Focus on the root cause, not just surface-level errors - Be specific about what the model should have done differently - You will receive bulletpoints that are part of playbook that’s used by the generator to answer the question. - You need to analyze these bulletpoints, and give the tag for each bulletpoint, tag can be [‘helpful’, ‘harmful’, ‘neutral’] (for the generator to generate the correct answer) Your output should be a json object, which contains the following fields - reasoning: your chain of thought / reasoning / thinking process, detailed analysis and calculations - error_identification: what specifically went wrong in the reasoning? - root_cause_analysis: why did this error occur? What concept was misunderstood? - correct_approach: what should the model have done instead? - key_insight: what strategy, formula, or principle should be remembered to avoid this error? - bullet_tags: a list of json objects with bullet_id and tag for each bulletpoint used by the generator Question: {} Model’s Reasoning Trace: {} Model’s Predicted Answer: {} Ground Truth Answer: {} Environment Feedback: {} Part of Playbook that’s used by the generator to answer the question: {} Answer in this exact JSON format: { "reasoning": "[Your chain of thought / reasoning / thinking process, detailed analysis and calculations]", "error_identification": "[What specifically went wrong in the reasoning?]", "root_cause_analysis": "[Why did this error occur? What concept was misunderstood?]", "correct_approach": "[What should the model have done instead?]", "key_insight": "[What strategy, formula, or principle should be remembered to avoid this error?]", "bullet_tags": [ {{"id": "calc-00001", "tag": "helpful"}}, {{"id": "fin-00002", "tag": "harmful"}} ] } Figure 13: ACE Reflector prompt on FINER 22

23. You are a master curator of knowledge. Your job is to identify what new insights should be added to an existing playbook based on a reflection from a previous attempt. Context: - The playbook you created will be used to help answering similar questions. - The reflection is generated using ground truth answers that will NOT be available when the playbook is being used. So you need to come up with content that can aid the playbook user to create predictions that likely align with ground truth. CRITICAL: You MUST respond with valid JSON only. Do not use markdown formatting or code blocks. Instructions: - Review the existing playbook and the reflection from the previous attempt - Identify ONLY the NEW insights, strategies, or mistakes that are MISSING from the current playbook - Avoid redundancy - if similar advice already exists, only add new content that is a perfect complement to the existing playbook - Do NOT regenerate the entire playbook - only provide the additions needed - Focus on quality over quantity - a focused, well-organized playbook is better than an exhaustive one - Format your response as a PURE JSON object with specific sections - For any operation if no new content to add, return an empty list for the operations field - Be concise and specific - each addition should be actionable Training Context: Total token budget: {token_budget} tokens Training progress: Sample {current_step} out of {total_samples} Current Playbook Stats: {playbook_stats} Recent Reflection: {recent_reflection} Current Playbook: {current_playbook} Question Context: {question_context} Your Task: Output ONLY a valid JSON object with these exact fields: - reasoning: your chain of thought / reasoning / thinking process, detailed analysis and calculations - operations: a list of operations to be performed on the playbook - type: the type of operation to be performed - section: the section to add the bullet to - content: the new content of the bullet Available Operations: 1. ADD: Create new bullet points with fresh IDs - section: the section to add the new bullet to - content: the new content of the bullet. Note: no need to include the bullet_id in the content like ‘[ctx-00263] helpful=1 harmful=0 ::’, the bullet_id will be added by the system. RESPONSE FORMAT - Output ONLY this JSON structure (no markdown, no code blocks): { "reasoning": "[Your chain of thought / reasoning / thinking process, detailed analysis and calculations here]", "operations": [ {{ "type": "ADD", "section": "formulas_and_calculations", "content": "[New calculation method...]" }} ] } Figure 14: ACE Curator prompt on FINER 23