Weaviate Context Engineering

如果无法正常显示，请先停止浏览器的去广告插件。

1. Context Engineering Designing the systems that control what information reaches the model and how it maintains coherence.

2. Ta b le of C ontents What Is Context Engineering? 01 What Are Agents? The Context Window Challenge Strategies and Tasks for Agents Where Agents Fit in Context Engineering 04 05 07 08 Q uery R ewriting Q uery Agents 10 11 12 13 A Guide to Chunking Strategies Simple Chunking Strategies Advanced Chunking Strategies Pre-Chunking vs. Post-Chunking Summary 15 17 18 20 23 Classic Prompting Techni q ues Advanced Prompting Techni q ues Prompting for Tool Usage Using Prompt Frameworks 25 26 26 27 The Architecture of Agent Memory K ey Principles for Effective Memory Management 29 31 The Evolution: From Prompts to Actions The Orchestration Challenge The Next Frontier of Tool Use 35 36 39 The Future of AI Engineering 40 Introduction Agents Q uery Augmentation Retrieval P rom p ting Tec h ni q ues M emory Tools Summary Context engineering Q uery Expansion Q uery D ecomposition

3. Introduction Every developer who builds with Large Language Models (LLMs) eventually hits the same wall. You start with a powerful model that can write, summarize, and reason with stunning capability. But when you try to apply it to a real-world problem, the cracks start to appear. It can't answer questions about your private documents. It has no knowledge of events that happened yesterday. It confidently makes things up when it doesn't know an answer. Vector search (retrieval) MCP 1 User Agentic coordination & decision making 3 6 2 Input Agent 5 Short-Term Memory Prompt 7 Store all context in chat history 01 RAG Databases Answer 8 You can't fix this fundamental limitation by j ust writing better prompts. You have to build a system around the model . That is Context Engineering. C ontext Engineering is the discipline of Long-Term Memory Add to memory The problem isn't the model's intelligence. The problem is that it's fundamentally disconnected. It's a powerful but isolated brain, with no access to your specific data, the live internet, or even a memory of your last conversation. This isolation is a direct result of its core architectural limit : the context window . The context window is the model's active working memory — the finite space where it holds the instructions and information for the current task. Every word, number, and piece of punctuation consumes space in this window. J ust like a whiteboard, once it ’ s full, older information gets erased to make room for new instructions, and important details can be lost . 4 Update prompt with answer Action Tools designing the architecture that feeds an LLM the right information at the right time. It ’ s not about changing the model itself, but about building the bridges that connect it to the outside world, retrieving external data, connecting it to live tools, and giving it a memory to ground its responses in facts, not j ust its training data . This ebook is the blueprint for that system. W e will cover the core components required to turn a brilliant but isolated model into a reliable, production-ready application . Mastering these components is the difference between a reasonable demo and a truly intelligent system. Let's get to work. Context engineering Ag e nts The decision-making brain that orchestrates how and when to use information. Qu ery Aug me ntati o n The art of translating messy, ambiguous user requests into precise, machine-readable intent. R e t r i e val The bridge connecting the LLM to your specific documents and knowledge bases. P rom pting T e chniqu e s The skill of giving clear, effective instructions to guide the model's reasoning. Memory The system that gives your application a sense of history and the ability to learn from interactions. T oo ls The hands that allow your application to take direct action and interact with live data sources. 02

4. What Are Agents? The term "agent" gets used broadly, so let’s define it in the context of building with large language models (LLMs). An AI agent is a system that can: 1 Agents 2 Make dynamic decisions about information flow. Rather than following a predetermined path, agents decide what to do next based on what they've learned. Maintain state across multiple interactions. Unlike simple Q&A systems, agents remember what they've done and use that history to inform future decisions. S earch Memory tools NO User query Answered before? Additional information required? NO As soon as you start building real systems with large language models, you run into the limits of static pipelines. A fixed recipe of “retrieve, then generate” works fine for simple Retrieval Augmented Generation (RAG) setups, but it falls apart once the task requires judgment, adaptation, or multi-step reasoning. This is where agents come in. In the context of context engineering, agents manage how (and how well) information flows through a system. Instead of blindly following a script, YES Decompose query into sub-queries Each sub-query NO Q uery routing & processing Retrieved information relevant? YES G enerate Response Legend YES response 4 Reasoning agents can evaluate what they know, decide what they still need, select the right tools, Tool Use and adjust their strategy when things go wrong. Memory Modify their approach based on results. When one strategy isn’t working, they can try different approaches. 3 Use tools adaptively. They can select from available tools and combine them in ways that weren’t explicitly programmed. Agents are both the architects of their contexts and the users of those contexts. However, they need good practices and systems to guide them, because managing context well is difficult, and getting it wrong quickly sabotages everything else the agent can do. Single-Agent Architecture Attempt to handle all tasks themselves, which works well for moderately complex workflows. Prompt Multi-Agent Architecture User Response AI Agent Thinking... Distribute work across specialized agents per task. Allows for complex workflows but introduces coordination challenges. 03 Context engineering 04

5. Here are some common types of errors that begin to happen or increase as context window size grows: Context Hygiene This is one of the most critical parts of managing agentic systems. Agents don’t just need memory and tools; they also need to monitor and manage the quality of their own context. That means avoiding overload, detecting irrelevant or conflicting information, pruning or compressing as needed, and keeping their in-context memory clean enough to Context Context reason effectively. P oi s oning Incorrect or hallucinated information enters the context. B ecause agents reuse and build upon that context, these errors persist and compound. The Context Window Challenge LLMs have limited information capacity because the context window can only Context hold so much information at once. This fundamental constraint shapes what D i s t r a c tion agents and agentic systems are currently capable of. The agent becomes burdened by too much past information Every time an agent is processing information, it needs to make decisions about: summaries — history, tool outputs, — and over-relies on repeating past behavior rather than reasoning fresh. What information should remain What should be stored externally active in the context window and retrieved when needed What can be summarized or How much space to reserve for compressed to save space reasoning and planning Context fus ion Context Con Irrelevant tools or documents crowd the context, distracting the model and causing it to use the wrong tool or instructions. It’s tempting to assume that bigger context windows solve this problem, but this is simply not the case. Longer contexts (hundreds of thousands or even ~1M tokens) actually introduces new failure modes. Performance often begins to degrade far before the model reaches Context maximum token capacity, where agents will become confused, have C ontradictory information within the higher rates of hallucination, or simply stop performing at the level context misleads the agent, leaving it they’re normally capable of. This isn’t just a technical limitation, it’s a stuck between conflicting assumptions. core design challenge of any AI app. 05 s Context Cla h Context engineering 06

6. Strategies and Tasks for Agents Agents are able to effectively orchestrate context systems because of their ability to reason and make decisions in a dynamic way. Here are some of the most common tasks agents are built for and employ to manage contexts. Context Summarization: Quality Validation: Periodically compressing Checking whether accumulated history into retrieved information is summaries to reduce consistent and useful. Where Agents Fit in Context Engineering Agents serve as coordinators in your context engineering system. They don’t replace the techniques covered in other sections, instead, they orchestrate them intelligently. An agent might apply query rewriting when initial searches are unsuccessful, choose different chunking strategies based on the type of content it encounters, or decide when conversation history should be compressed to make room for new information. They provide the orchestration layer needed to make dynamic, context-appropriate decisions about information management. Different types of agents and functions within a context engineering system: burden while preserving key knowledge. UP E R VI S O RS S Context Pruning: Route to Specialized Planning Actively removing irrelevant or outdated context, either with specialized pruning models G enerate Final Response Refine Q uery or a dedicated LLM tool. P E CIALIZ E D AG E NT S Send Retrieval Request S Adaptive Retrieval Strategies: Q uery Rewriter Reformulating queries, switching knowledge bases, or changing chunking strategies when initial attempts fail. Context Offloading: Storing details Retriever externally and retrieving them only Answer Synthesizer when needed, instead of keeping Failed Data Collection Selector everything in active context. Tool Router MEMO R Y CAPABILITI E S AND KN O WL E DG E S O U R C E S short TERM MEMORY Change strategy Send Context for Synthesis Dynamic Tool Selection: Instead of dumping M ulti - Source Synt h e s i s : every possible tool into the prompt, agents Combining information from filter and load only those relevant to the task. multiple sources, resolving This reduces confusion and improves accuracy. conflicts, and producing Return Tool Results Compressor E X TERN A L K NO W LE D GE S O U R C E S W orking M emory Return Retrieved Facts coherent answers. LONG TERM MEMORY Vector DB Episodic and Factual Store Tools and AP I s Sync Episodic M emory Vector DB K nowledge Collections W eb and Search AP I s Q uery and U pdate 07 Context engineering 08

7. Query Rewriting Query Augmentation Query rewriting transforms the original user query into a more effective version for information retrieval. Instead of just doing retrieve-then-read, applications now do a rewrite-retrieve-read approach. This technique restructures oddly written questions so they can be better understood by the system, removes irrelevant context, introduces common keywords that improve matching with correct context, and can split complex questions into simpler sub-questions. Raw Query One of the most important steps of context engineering is how you prepare and present the user's query. Without knowing exactly what the user is asking, the LLM cannot provide an accurate response. Query Re-writer (LLM) Rewritten Query query=”API call failure, troubleshooting authentication headers, rate limiting, network timeout, 500 error” How do i make this work when my api call keeps failing? Though this sounds simple, it's actually quite complex. There are two main issues to think about: 1 Users often don't interact with chatbots or inputs in the ideal way. Many product builders will often develop and test chatbots with queries that provide the request and all additional information that the LLM would need to understand the question in a succinct, perfectly punctuated, clear way. Unfortunately, in the real world, user interactions with chatbots can be unclear, messy, and not complete. In order to build robust systems, it's important to implement solutions that deal with all types of interactions, not just ideal ones. 2 Different parts of the pipeline need to deal with the query in different ways. A question that an LLM could understand well might not be the best format to search through a vector database with. Or, a query term that works best for a vector database could be incomplete for an LLM to answer. Therefore, we need a way to augment the query that suits different tools and steps within the pipeline. Remember that query augmentation addresses the "garbage in, garbage out" problem at the very start of your pipeline. No amount of sophisticated retrieval algorithms, advanced reranking models, or clever prompt engineering can fully compensate for misunderstood user intent. 09 RAG applications are sensitive to the phrasing and specific keywords of the query, so this technique works by: Restructuring Unclear Questions: Context Removal: Keyword Enhancement: Transforms vague or poorly formed user input into precise, information-dense terms. Eliminates irrelevant information that could confuse the retrieval process. Introduces common terminology that increases the likelihood of matching relevant documents. Context engineering 10

8. Query Expansion Query D e c o m posi t ion Query expansion enhances retrieval by generating multiple related queries from a single user input. This approach improves results when user queries are vague, poorly formed, or when you need broader coverage, such as with keyword-based retrieval systems. Query decomposition breaks down complex, multi-faceted questions into simpler, focused sub-queries that can be processed independently. This technique is especially good for questions that require information from multiple sources or involve several related concepts. R a w Quer y Quer y R e- W riter (LLM) Expanded Queries Natural language processing tools Free nlp libraries Open source NLP tools Open source language processing platforms NLP soft w are w it h open source co d e C ontext W indow However, query expansion comes with challenges that need careful management: Query Drift: Over-Expansion: Expanded queries may diverge from the user's original intent, leading to irrelevant or off-topic results. Adding too many terms can reduce precision and retrieve excessive irrelevant documents. Computational Overhead: Processing multiple queries increases system latency and resource usage. T he process typically involves two main stages: De c omposition P hase: P ro c essin g P hase: An LLM analy z es the original complex query and breaks it into smaller, focused sub-queries. Each sub-query targets a specific aspect of the original question. Each sub-query is processed independently through the retrieval pipeline, allowing for more precise matching with relevant documents. After retrieval, the context engineering system must aggregate and synthesi z e results from all sub-queries to generate a coherent, comprehensive answer to the original complex query. 11 Context engineering 12

9. Query Agents Query Agents advanced are form the of User query most query Analysis augmentation, using AI agents Analysis: Use generative models (e.g. large language models) to analyze the task & the required queries. Determine the exact queries to perform. to intelligently handle the entire query processing combining the pipeline, Construct query techniques above. A query agent takes prompt/question a in user’s Search natural Aggregation Dynamic Query Construction: Rather than using predetermined query patterns, the agent constructs queries on-demand based on understanding both the user intent and the data schema. This means it can add filters and adjust search terms automatically to find the most relevant results in the database, as well as choosing to run searches, aggregations or even both at the same time for you. language and decides the best way to structure the query based on it’s knowledge of the database and data Execution structure, Query execution: Formulates and sends queries to the agent’s chosen collection or collections. and can iteratively decide to re- query and adjust based on the Choose collection results returned. l i ll i n rout in g : The agent understands the structure of all of your collections, so it can intelligently decide which data collections to query based on the user ' s question. Mu t -co ect o ector data b ase ector data b ase V V Collection A Collection B al u a t i o n: The agent can evaluate the retrieved information within the context of the original user query. I f there is missing information, the agent can try a different knowledge source or new query. Ev Retrieved information relevant? NO YES Finish Generate a text response? YES Response 13 tiona l) R es p onse g eneration: Receive the results from the database, and use a generative model to generate the final response to the user’s prompt / query. (Op Contextua l a w areness: The context may also include previous conversation history, and any other relevant information. The agent can maintain conversation context for follow-up questions. NO Finalize context Context engineering

10. The challenge is simple in concept but tricky in practice: a raw dataset of documents is almost always too large to fit into an LLM's limited context window (the inputs given to an AI model). We can't just hand the model an entire set of user manuals or research papers. Instead, we must find the perfect piece of those documents, the single paragraph or section that contains the answer to a user's query. To make our vast knowledge bases searchable and find that perfect piece, we must first break our documents down into smaller, manageable parts. This foundational process, known as chunking, is the key to successful retrieval. Retrieval The Context Window Document S ystem prompt You are a helpful AI assistant... U ser q uery Chunk What is a vector database? A Large Language Model is only as good as the information it can access. While LLMs are Rele v ant chunks trained on massive datasets, they lack knowledge of your specific, private documents Doc 1; Chunk 1 Doc 2; Chunk 1 and any information created after their training was completed. To build truly intelligent Doc 2; Chunk 2 Doc 3; Chunk 1 applications, you need to feed them the right external information at the right time. This process is called Retrieval. Pre-Retrieval and Retrieval steps make up the first parts of many AI architectures that rely on context engineering, such as Retrieval Augmented Generation (RAG). Documents Learn how chunking strategies can help improve your RAG performance and explore different chunking methods. Read the complete blog post here: weaviate.io/blog/chunking-strategies-for-rag Chunks Pre-Retrieval A Guide to Chunking Techniques Query Retrieval Embedding Model Chunking is the most important decision you will make for your retrieval system's performance. It is the process of breaking down large documents into smaller, manageable pieces. Get it right, and your system will be able to pinpoint relevant facts Vector Database with surgical precision. Get it wrong, and even the most advanced LLM will fail. Context Prompt Template Chunks Response LLM Post-Chunking 14 Context engineering 15

11. Simple Chunking Techniques The Chunking Strategy Matrix high low ixed - S i z e Chun k in g : The simplest method. The text is split into chunks of a predetermined size (e.g., 512 tokens). It's fast and easy but can awkwardly cut sentences in half. U sing an overlap (e.g., 50 tokens) between chunks helps mitigate this. F Precise but Incomplete The Sweet Spot (Optimal Chunks) Overly small chunks (e.g., single sentences) that are easy to find but lack the context for the LLM to generate a good response. Semantically complete paragraphs that are focused enough to be found and rich enough to be understood. P ho o t c h es i s i s o ne o f nature's m o st v i ta l pr o cesses. h unk 1 c h unk 2 c o ver l ap he F ailure Z one oorly constructed, random chunks that are neither findable nor useful, the worst of both worlds. P h unk 3 o ver l ap Rich but Unfindable T Oversized chunks that contain the answer but have "noisy" embeddings, making them impossible for the retrieval system to find accurately. Contextual richness Recursive Chun k in g : A more intelligent approach that splits text using a prioritized list of separators (like paragraphs, then sentences, then words). It respects the document's natural structure and is a solid default choice for unstructured text. high gl ement i s a key c o ncept i n quantum p h ys i cs. It o ccurs wh en part i c l es bec o me li nked, s o t h e state o f o ne i nstant l y affects t h e state o f an o t h er, n o matter t h e d i stance bet w een t h em Quantum entan When designing your chunking strategy, you must balance two competing priorities: Retrieval Precision: Chunks need to be small and focused on a single idea. This creates a distinct, precise embedding, making it easier for a vector search system to find an exact match for a user's query. Large chunks that mix multiple topics create "averaged," noisy embeddings that are hard to retrieve accurately. Contextual Richness: Chunks must be large and self-contained enough to be understood. After a chunk is retrieved, it is passed to the LLM. If the chunk is just an isolated sentence without context, even a powerful model will struggle to generate a meaningful response. The goal is to find the "chunking sweet spot", creating chunks that are small enough for precise retrieval but complete enough to give the LLM the full context it needs. Your choice of strategy will depend on the nature of your documents and the needs of your application. 1 6 synt hi s c o nnect io n c h a ll en g es o ur understand i n g o f space and t i me. W h en y o u measure o ne entan gl ed part i c l e, t h e o t h er's state c h an g es i nstant l y. T 1 D i ( g 2 Sp t t 3 If c 4 ef ne a e. hi erarc h y o f separat o rs g rap h s → sentences → wo rds ) ., para li h e text us i n g t h e high est- l eve l separat o r h unks are t oo b ig , sp li t a g a i n us i n g t h e o r next separat R epeat unt il a ll c h unks f i t wi t hi n t h e des i red i z e whil e preserv i n g mean i n g s ocument -B ased Chun k in g : This method uses the document's inherent structure. F or example, it splits a Markdown file by its headings ( # , ## ), an H TML file by its tags ( < p > , < div > ), or source code by its functions. D 1 # H i g T hi s i s a h ead i n g . -- ## Sub h ead i n g T hi s i s a sub h ead i n g . We can c o nt i nue wi t h m o re c o ntent h ere. -- ## T hi s i s a sec o nd sub h ead i n g H ere i s d i fferent c o ntent. ead n ( e. 2 G o 3 V 4 Context engineering i logi ca l d o cument b o undar i es g ., c h apters, sect io ns, h ead i n g s ) Ident fy c r up c o ntent under eac h b o undary i nt o oh es i ve un i ts o r i z e eac h un i t as a standa lo ne c h unk ect o re c h unks wi t h metadata li nk i n g t h em o t h e i r s o urce d o cument and sect io n St t 1 7

12. Advanced Chunking Techniques Hierarchical Chunking: Creates multiple layers of chunks at different levels of detail (e.g., top-level summaries, mid-level sections, and granular paragraphs). This allows a retrieval S Instead of using separators, this technique splits text based on meaning. It groups semantically related sentences together and creates a new chunk only when the topic shifts, resulting in highly coherent, self-contained chunks. emantic Chunking: system to start with a broad overview and then drill down into specifics as needed. Title Chunk 1 Title through the E v v arth and atmosphere. It in ol es processes such as v turning it into Split te t into sentences or paragraphs 2 V aporation occurs when the sun heats up water in ri ers, lakes, or oceans, x 1 E v v e aporation, condensation, precipitation, and collection. Documents z E v v entually, the clouds become hea y and Intro Intro Chunk 2 Chunk 3 Method Method ectori e windows of sentences v apor or steam. This v apor rises into the air and cools down, forming clouds. Abstract Abstract v The water cycle is a continuous process by which water mo es Chunk 4 Reference Reference 3 Calculate cosine distance between all pairs 4 Merge until breakpoint is reached Chunk 5 water falls back to the earth as precipitation, which can be rain, snow, sleet, or hail. This water then collects in bodies of water, continuing the cycle. Late Chunking: An architectural pattern that inverts the standard process. It embeds -B ased LLM Chunking: U ses a L arge L anguage M odel the entire document first to create token-level embeddings with full context. Only then is to intelligently process a document and generate semantically coherent chunks. Instead of relying on fixed rules, the LLM can z the document split into chunks, with each chunk's embedding derived from these pre- computed, context-rich tokens. identify logical propositions or summari e sections to create meaning-preserving pieces. Input text LLM Propositions “Alice went for a walk in the woods one day and on her walk, she spotted something. She x v isited the library. x v isited the library. H e lo v es reading. Ale Ale saw a rabbit hole at the base of a large tree. 1 She fell into the hole and found herself in a x lo v es reading. Ale strange new world.” 2 Agentic Chunking: T his takes the concept a step further then LLM-B ased C hunking. An AI Chunk the token embeddings ( instead of the raw te x t ) . Embedding Model Preserve context because the z agent dynamically analy es a document's structure and content to select the best chunking ( Embed the entire document using a long-context model to generate token-level embeddings . embeddings were created with full ) to apply for that specific document. strategy or combination of strategies 3 x document conte t, each token v preser es its relationship to tokens in neighboring chunks. Documents Selecting Method HTML 1 8 Pool strategically instead of pooling all tokens into one v ector, late chunking e Chunking D ocument - based 4 Chunking z x - aware embeddings per document. Approach Analy es document pools tokens according to your chunking strategy to get multiple conte tually H ybrid PDF Docx Optimized Chunks z F i x ed - si Markdown Docx Chunks Semantic Chunking format and content Context engineering 1 9

13. Pre-Chunking vs. Post-Chunking Pos t -Chunking Beyond how you chunk, a key system design choice is when you chunk. This decision leads to two primary architectural patterns. An ad v anced, real - time alternati v e where chunking happens a f ter a document has b een retrie v ed, in direct response to a user ' s q uery. Pre-Processing Pre-Chunking Documents Clean text (remove headers, footers, special characters, etc.) The most common method, where all data processing happens up f ront and o ff line, b e f ore any user q ueries come in. Retrieval by semantic similarity Query Pre-Processing Clean text (remove headers, footers, special characters, etc.) Documents Embedding Model Pre-Chunking Chunks Vector Database Split documents into smaller chunks (e.g., 500 tokens per chunk, semantic chunks, hierarchical chunks). Retrieval by semantic similarity Query Pos t -Chunking Retrieved Context Embedding Model Retrieve full documents, then chunk and rerank before adding to LLM context w indo w . Store chunked documents. Chunks Prompt Template Augmented Large Language Model Generation Vector Database Embedded Chunks Output Retrieved Context Output ork fl o w W Prompt Template Augmented Large Language Model Generation ork fl o w W C ON Clean Data -> Chunk Documents -> Embed & Store Chunks P RO etrie v al is e x tremely f ast at q uery time b ecause all the work has already b een done. The system only needs to per f orm a q uick similarity search. 2 0 R etr i e v e R ele v ant Document -> Chunk D y nam i call y P RO t s highly f le x i b le. Y ou can create dynamic chunking strategies that are speci f ic to the conte x t o f the user ' s q uery, potentially leading to more rele v ant results. I ' R Store Documents -> C ON The chunking strategy is f i x ed. If you decide to change your chunk si z e or method, you must re - process your entire dataset. t adds latency. The chunking process happens in real - time, making the f irst response slower f or the end - user. I t also re q uires more comple x in f rastructure to manage. I We built a post-chunking strategy into Elysia, our open source agentic RAG framework. You can read more about that here: https://weaviate.io/blog/elysia-agentic-rag#chunk-on-demand-smarter-document-processing Context engineering 21

14. Guide to Choosing Your Chunking Strategy Chunking Strategy How It Works Fixed-Size Recursive Co mpl e x ity B est F or Ex a mpl es Splits by token or character count. Small or simple docs, or when speed matters most. M eeting notes, short blog posts, emails, simple F A Q s. Splits text by repeatedly dividing it until it fits the desired chunk size, often preserving some structure. D ocuments where some structure should be maintained but speed is still important. R esearch articles, product guides, short reports. Document- Based Splits only at document boundaries or by structural elements like headers. C ollections of short, standalone documents or highly structured files. N ews articles, customer support tickets, M arkdown files. Semantic Splits text at natural meaning boundaries (topics, ideas). T echnical, academic, or narrative documents where topics shift without clear separators. Scientific papers, textbooks, novels, whitepapers. LLM-Based Uses a language model to decide chunk boundaries based on context and meaning. C omplex text where meaning - aware chunking improves downstream tasks like Q& A. Long reports, legal opinions, medical records. Agentic Lets an AI agent decide how to split based on meaning and structure. C omplex, nuanced documents that re q uire custom strategies. R egulatory filings, multi - section contracts, corporate policies. Late Chunking Embeds the whole document first, then derives chunk embeddings from it. Use cases where chunks need awareness of the full document ' s context. C ase studies, comprehensive manuals, long - form analysis reports. Hierarchical B reaks text into multiple levels (sections → paragraphs → sentences). Large, structured documents where both summary and detail are needed. Employee handbooks, government regulations, software documentation. 22 Su mm ary The effectiveness of your Retrieval Augmentation system is not determined by a single “magic” bullet, but by a series of deliberate engineering choices. The quality of the context you provide to an LLM is a direct result of two key decisions: The Chunking Strategy The Architectural Pattern T he " Ho w" T he "W hen " The method you choose to break down your documents. The point at which you perform the chunking. Mastering these two elements is fundamental to context engineering. A well-designed retrieval system is the difference between an LLM that guesses and one that provides fact-based, reliable, and contextually relevant answers. Context engineering 23

15. Classic Prompting Techniques Chain of Thought Prompt Prompting Techniques Prompt engineering is the practice of designing, refining, and optimizing inputs (prompts) given to Large Language Models (LLMs) to get your desired output. The quality and effectiveness of LLMs are heavily influenced by the prompts they receive, and the way you phrase a prompt can directly affect the accuracy, usefulness, and clarity of the response. It’s essentially about interacting with AI efficiently: giving it instructions, examples, or questions that guide the model toward the output you need. In this section, we’ll go over prompting techniques that are essential for improving Retrieval-Augmented Generation (RAG) applications and overall LLM performance. o t Th ugh ... o t Th ugh Response This technique involves asking the model to “think step-by-step” and break down complex reasoning into intermediate steps. This is especially helpful when retrieved documents are dense or contain conflicting information that requires careful analysis. By verbalizing its reasoning process, the LLM can come at more accurate and logical conclusions. Few-Shot Prompting This approach provides the LLM with a few examples in the context window that demonstrate the type of output or “golden” answers you want. Showing examples helps the model understand the desired format, style, or reasoning approach, improving response accuracy and relevance, especially for specialized or technical domains. Prompt mp l e mp l e Exa I np u t D es i re d o u t c ome mp l e Exa I np u t D es i re d o u t c ome Exa I np u t D es i re d o u t c ome p a ttern re c o g n i t i on orm a t , st yl e , l o gic) LLM: (f Response Combining CoT and Few-shot examples is a powerful way to guide both the model’s reasoning process and its output format for maximum efficiency. Important Note: Prompt engineering focuses on how you phrase instructions for the LLM. Context engineering, on the other hand, is about structuring the information and knowledge you provide to the model, such as retrieved documents, user history, or domain-specific data, to maximize the model’s understanding and relevance. Many of the techniques below (CoT, Few-shot, ToT, ReAct) are most effective when combined with well-engineered context. 24 Pro Tip #1: Make the model reasoning in Chain of Thought very specific to your use-case. For example, you might ask the model to: valuate the environmen t Repeat any relevant informatio n E xplain the importance of this information to the current request E Context engineering Pro Tip #2: Maximize efficiency and reduce token count, asking the model to reason in a " draft " form, using no more than 5 words per sentence. This makes sure that the model ' s thought process is visible while reducing output token count. 25

16. Advanced Prompting Strategies Building on classic techniques, advanced strategies guide LLMs in more sophisticated ways: This very precise guidance, which should be included as part of your overall tool description, helps the LLM understand the exact boundaries and functionalities of each available tool, minimi ing error, and improving overall system reliability. z R e Ac t P ro mp t in g: Tree of Thoughts (ToT): P ro T ip : H o w to W r i te a n Eff e c t i v e Too l D es c r ip t i o n The LLM s decision to use your tool depends entirely on its description. Make it count: ' Use an Active Verb: S tart with a clear action. get_current_weather is better than weather_data. e ecific out n uts l arl stat at ar gu m en ts t he tool ex p ec ts a nd t hei r f ormat (e.g. , ci t y ( str ing) , d at e ( str ing , YYYY-MM-DD)). B ToT builds on CoT by instructing the model to explore and evaluate multiple reasoning paths in parallel, much like a decision tree. The model can generate several different solutions to a problem and choose the best result. This is especially useful in RAG when there are many potential pieces of evidence, and the model needs to weigh different possible answers based on multiple retrieved documents. This framework combines CoT with agents, enabling the model to " Reason " ( think ) and " Act " dynamically. The model generates both reasoning traces and actions in an interleaved manner, allowing it to interact with external tools or data sources and ad j ust its reasoning iteratively. ReAct can improve RAG pipelines by enabling LLMs to interact with retrieved documents in real time, updating reasoning and actions based on external knowledge to give more accurate and relevant responses. Prompting for Tool Usage When your LLM interacts with external tools, clear prompting ensures correct tool selection and usage. Defining Parameters and Execution Conditions LLMs can sometimes make incorrect tool selections or use tools in suboptimal ways. To prevent this, prompts should clearly define: Wh e n to u se a too l: S pecif y scenarios or conditions that trigger a particular tool. 26 How to use a tool: Provide expected inputs, parameters, and desired outputs. Examples: Include few-shot examples showcasing correct tool selection and usage for various queries. For instance: User Query: "What's the weather like in Paris?" -> Use Weather_API with city="Paris" User Query: "Find me a restaurant near the Eiffel Tower." -> Use Restaurant_Search_Tool with location="Eiffel Tower" Sp Ab I p Descri e t e ut ut b h O p : C e y e wh : Tell the model what to expect in return (e.g., returns a JSON object with "high", "low", and "conditions".). ention imitations l nly wor ks for a s pec i f i c reg i o n or t i me fr a me, say s o (e.g., Note: O nly wor ks for c i t i e s in the U S A.). M L : If the too o Using Prompt F rame w or k s I f you are building a pro j ect that requires extensive prompting or want to systematically improve your LLM results, you could consider using frameworks like: DSPy, Llama Prompt Ops, Synalinks. That said, you don ’ t necessarily need to use a framework. F ollowing the prompting guidelines outlined ( clear instructions, Chain of Thought, F ew - shot Learning, and advanced strategies ) can achieve highly effective results without additional frameworks . Think of these frameworks as optional helpers for complex pro j ects, not a requirement for everyday prompt engineering. Context engineering 7 2

17. This is where context engineering becomes an art. The goal isn’t to shove more data into the prompt but to design systems that make the most of the active context window - keeping essential information within reach while gracefully offloading everything else into smarter, more persistent storage. Context Offloading is the practice of storing information outside the LLM’s active context window, often in external tools or vector databases. This frees up the limited token space so that only the most relevant info stays in context. Memory The Architecture of Agent Memory Memory in an AI agent is all about retaining information to navigate changing tasks, remember what worked (or didn't), and think ahead. To build robust agents, we need to think in layers, often blending different types of memory for the best results. When you're building agents, memory isn't just a bonus feature - it's the very thing that breathes life into them. Without it, an LLM is just a powerful but stateless text processor that responds to one query at a time with no sense of history. Memory transforms these models into something that feels more dynamic and, dare we say, more ‘human’, that’s capable of holding onto context, learning from the past, and adapting on the fly. Andrej Karpathy gave us the perfect analogy when he compared an LLM’s context window to a computer’s RAM and the model itself to the CPU. In this view, the context window is the agent's active consciousness, where all its "working thoughts" are held. But just like a laptop with too many browser tabs open, this RAM can fill up fast. Every message, every tool output, every piece of information consumes precious tokens. Peri p her al Short-Term Memory C o n te x t Wind o w User: “What’s the weather?” AI: “It’s sunny, 24°C” AI: “ No nee d , i t’s war m! ” S hort-term Sof t wa re 1.0 t ools “cla ssi cal compu ter ” Calculator Pytho n Interpreter Terminal Disk F ile S yste m ( + embeddings) Audio CPU LLM Ethernet Browser RAM Contex t W indow O ther LLMs Ex ter nal Stor ag e V ector D atabase Episodic S emantic Procedural User: “ S h o u ld I b r i n g a j a ck et?” d e v i c es I/O V ideo L o ng -Term Memory memory is the agent's immediate workspace. It's the "now," stuffed into the context window to fuel on-the-fly decisions and reasoning. This is powered by in-context learning, where you pack recent conversations, actions, or data directly into the prompt. Because it's constrained by the model's token limit, the main challenge is efficiency. And, the trick is to keep this streamlined to reduce costs and latency without missing any details that might be important for the next processing steps. Long-term memory moves past the immediate context window, storing information externally for quick retrieval when needed. This is what allows an agent to build a persistent understanding of its world and its users over time. It's commonly powered by Retrieval- Augmented G eneration ( RA G) , where the agent queries an external knowledge base (like a vector database) to pull in relevant information. This memory can store different kinds of information, like for example : episodic memory to store specific events or past interactions, or semantic memory that holds general knowledge and facts. This could also be information from company documents, product manuals, or a curated domain-knowledge base, allowing the agent to answer questions with factual accuracy. Source: Andrej Karpathy: Software Is Changing (Again) 28 Context engineering 29

18. Hybrid Memory Setup In reality, most modern systems use a hybrid approach, blending short-term memory for speed with long-term memory for depth. Some advanced architectures even introduce additional layers: Working Memory: A temporary holding area for information related to a specific, multi-step task. For example, if an agent is booking a trip, its working memory might hold the destination, dates, and budget until the task is complete, without cluttering the long-term store. Procedural Memory: This helps an agent learn and master routines. By observing successful workflows, the agent can internalize a sequence of steps for a recurring task, making it faster and more reliable over time. o t e e o Sh r -T rm M m ry Immediate reasoning space, bounded by context limit. Context Window Query Thought Task State Retrieve Tool Call Thought “Book me a flight to Tokyo in December.” “Need to check budget, dates, preferences” “Store the task- specific state in the working memory” “User preferences, travel domain knowledge, booking routines, etc.” “Flights API, Weather API, Calendar, etc.” “Review the context & decide” Task State Storage Task Context Recall 1 the agent sends Retrieval the agent pulls the specific, in-progress the agent retrieves relevant task details back into its active inform its current temporary scratchpad reasoning space to decision like past travel to keep the main context window from “Respond & then clear the state & update memory” Memory Storage 3 after an interaction, relevant knowledge to task details to a e o R sp nse ... or tool call, the agent saves important information, new user continue a multi-step preferences, travel preferences, or process. domain knowledge successful outcomes/ getting cluttered. (airlines, airport codes, E ffective memory management can make or break an LLM agent. P oor memory practices lead to error propagation, where bad information gets retrieved and amplifies mistakes across future tasks. Here a re s ome o f t h e s t a rti ng pri nc ip l e s f or g etti ng t h i ngs ri gh t : P ru n e an d R e fin e Y our Memorie s M emory isn ' t a write-once system. It needs regular maintenance. P eriodically scan your long-term storage to remove duplicate entries, merge related information, or discard outdated facts. A simple metric for this could be the recency and retrieval frequency. If a memory is old and rarely accessed, it might be a candidate for deletion, especially in evolving environments where old information can become a liability. For example, a customer support agent might automatically prune conversation logs that are over 90 days old and marked as resolved, closed, or no longer active in the memory. It could j ust retain the summaries ( for trend detection and analysis ) rather than full word-to-word transcripts. workflows to its visa rules, etc.), or 2 Key Principles for Effective Memory Management 4 learned workflows. permanent memory for future use. Merged Wo rk in g M e m o ry L on g-T e rm M e m o ry A temporary scratchpad or buffer box. Persistent storage system to retain and recall information across sessions. Task Context ete rs Param "task_id": "book_flight_001" "dates": "departure": "2025-12-15" "task_status": "in_progress" "return": "2025-12-22" Episodic Memory "task_type": "travel_booking" "task_context": ( past "constraints": "destination": "Tokyo" "budget_max": 1200, "origin": "San Francisco" "preferred_time": "morning", “tools_available”: “...” "preferred_airlines": ["JAL", "ANA"] e v ents / interactions / preferences ) Semantic Memory ( general + ext S te p s /R e s ul t s domain Pruned k no w ledge ) N "intermediate_results": "flights_found": 12 "top_candidates": Procedural Memory "flight": "JAL005", "price": 1150, ... "flight": "ANA106", "price": 1180, ... ( learned routines / decision w or k flo w s ) "next_steps": ["compare_amenities", "check_baggage_policy", "confirm_selection"] 30 Context engineering 31

19. Be Selective About What You Store Master the Art of Retrieval Not every interaction deserves a permanent spot in long-term storage. One must implement some sort of filtering criteria to assess information for quality and relevance before saving it. A bad piece of retrieved information can often lead to context pollution, where the agent repeatedly makes the same mistakes. One way to prevent this is to have the LLM "reflect" on an interaction and assign an importance score before committing it to memory. Effective memory is less about how much you can store and more about how well you can retrieve the right piece of information at the right time. A simple blind search is often not enough, so advanced techniques like reranking (using an LLM to re-order retrieved results for relevance) and iterative retrieval (refining/ expanding a search query over multiple steps) can be used to improve the quality of retrieved information. Tools like the Q uery Agent and P ersonalization Agent offer these capabilities out of the box, enabling searches across multiple collections and reranking based on user preferences and interaction history. Add to C onte x t Deleted E x panded Queries etrie v al R eran k ing R Tailor the Architecture to the Task There is no one-size-fits-all memory solution. A customer service bot needs a strong episodic memory to recall user history, while an agent that analyzes financial reports needs a robust semantic memory filled with domain-specific knowledge. Always start with the simplest approach that works (like a basic conversational buffer with last ‘n’ queries/responses) and gradually layer in more advanced mechanisms as the use case demands it. Ultimately, memory is what elevates LLM agents from simple responders to intelligent context-aware systems. Effective memory isn’t simply a passive storage… It’s an active, managed process! The goal is to build agents that don't just store memory, but can manage it - knowing what to remember, what to forget, and how to use the past to reason about the future. Optional additions: Episodic Memory Tools Query Output Semantic Memory 32 Query Augmentation Context engineering 33

20. The Evolution: From Prompts to Actions The journey to modern tool use has been a rapid evolution. Initially, devs tried to get action out of LLMs with good old prompt engineering by tricking the model into generating text that looked like a command. It was clever but prone to errors. The real breakthrough was function calling, aka tool calling. This capability, now native to most models, allows an LLM to output structured JSON that can contain the name of a Tools function to call and the arguments to use. With this, there are a bunch of possibilities: A Simple Tool A Chain of Tools If memory gives an agent a sense of self, then tools are what give it superpowers. By A travel agent bot can use a For a complex request like "Plan a themselves, LLMs are brilliant conversationalists and text manipulators, but they live search_flights tool, and when a weekend trip to San Francisco for inside a bubble. They can't check the current weather, book a flight, or look up real-time user asks, "Find me a flight to me," the agent might need to chain stock prices. They are, by design, disconnected from the living, breathing world of data Tokyo next Tuesday," the LLM several tools together: find_flights, and action. doesn't guess the answer. It search_hotels, and generates a call to the function you get_local_events. This requires the provided, which in turn queries a agent to reason, plan, and execute real airline API. a multi-step workflow. This is where tools come in. A "tool" is anything that connects an LLM agent to the outside world, allowing it to take direct “action” in the real world and retrieve information required to fulfill a task. Integrating tools elevates an agent from just being a knowledgeable consultant to something that can actually get things done. Context engineering for tools isn't just giving an agent a list of APIs and instructions. It's about creating a cohesive workflow where the agent can understand what tools are The work of context engineering here is in how you present these tools. A well-written available, decide correctly which one to use for a specific task, and interpret the results tool description is like a mini-prompt that guides the model, making it crystal clear what to move forward. the tool does, what inputs it needs, and what it returns. Query Thought Action Observation Response Repeat Until Goal Satisfied 34 Context engineering 35

21. The Orchestration Challenge Giving an agent a tool is easy (mostly). Getting it to use that tool reliably, safely, and effectively is where the real work begins. The central task of context engineering is orchestration, i.e., managing the flow of information and decision-making as the agent reasons about which tool to use. This involves a few key steps that happen in the context window. Let’s break down these key orchestration steps using Glowe, a skincare domain knowledge app powered by our Elysia orchestration framework, as our running example. Tool Discovery: The agent needs to know what tools it has at its disposal. This is usually done by providing a list of available tools and their descriptions in the system prompt. The quality of these descriptions is very critical. They are the agent's only guide to understanding what each tool does, allowing the model to understand when to use a tool and, more importantly, when to avoid it. Here, the decision agent correctly analyzed the incoming request and selected the product_agent tool. A rgu m ent F or m ulation ( A ction): O nce a tool is selected, the agent must figure out what arguments to pass to it. I f the tool is get _ weather(city, date), the agent needs to extract "S an F rancisco " and " tomorrow " from the user's query and format them correctly. This could also be a structured request or API call with the necessary information to use the tool. In Glowe, we configure a set of specialized tools (Step 5) with precise descriptions when initializing every new chat tree. Tool Selection and Planning (Thought): When faced with a user request, the agent must reason about whether a tool is needed. If so, which one? For complex tasks, it might even need to chain multiple tools together, forming a plan (e.g., "First, search the web for the weather; then, use the email tool to send a summary"). 36 I n this case, the product_agent required a te x t query f or searching the products collection. N otice ho w the agent also corrected itsel f ( sel f- healing ) a f ter generating an ill -f ormed argument that initially caused an error ( another k ey piece o f orchestration ) . Context engineering 37

22. Reflection (Observation): After executing the tool, the output (the "observation") is fed back into the context window. The agent then reflects on this output to decide its next step. Was the tool successful? Did it produce the information needed to answer the user's query? Or did it return an error that requires a different approach? The Next Frontier of Tool Use The evolution of tool use is moving more and more towards standardization. While function/tool calling works well, it creates a fragmented ecosystem where each AI application needs custom integrations with every external system. The Model Context Protocol (MCP), introduced by Anthropic in late 2024, addresses this by providing a universal standard for connecting AI applications to external data sources and tools. They call it "USB-C for AI" - a single protocol that any MCP-compatible AI application can use to connect to any MCP server. So, instead of building custom integrations for each tool, developers can just create individual MCP servers that expose their systems through this standardized interface. Any AI application that supports MCP can then easily connect to these servers using the JSON-RPC based protocol for client-server communication. This transforms the MxN integration problem (where M applications each need custom code for N tools) into a much simpler M + N problem. Traditional Integration vs MCP Approach As you can see, orchestration happens through this powerful feedback loop, often called the Thought-Action-Observation cycle. The agent observes the outcome of its action and uses that new information to fuel its next "thought," deciding whether the task is complete, if it needs to use another tool, or if it should ask the user for clarification. This Thought-Action-Observation cycle forms the fundamental reasoning loop in modern agentic frameworks like Elysia. Traditional: NxM Connections MCP: N+M Connections MCP Model 1 D atabase Model 1 Model 2 Cloud Storage Model 2 Cloud Storage Model 3 G it Model 3 G it D atabase E ach model needs custom integration with each data source M odels and data sources only need to integrate once with MCP 9 Total Connections 6 Total Connections Visual inspired by : https://humanloop.com/blog/mcp This shift towards composable, standardized architectures, where frameworks enable developers to build agents from modular, interoperable components, represents the future of AI tooling. It changes the engineer's role from writing custom integrations to orchestrating adaptive systems that can easily connect to any standardized external system. 38 Context engineering 39

23. Context enginee r ing is m a de up o f t h e com p onents desc r ibed in t h is eboo k: Agents to act as the system's decision-making brain. Query Augmentation to translate messy human requests into actionable intent. Retrieval to connect the model to facts and knowledge bases. Memory to give your system a sense of history and the power to learn. Tools to give your application hands to interact with live data and APIs. Summary W e ar e moving on fr om being pr om p te r s w h o t a l k to a model a nd inste a d , becoming ar c h itects w h o b u ild t h e wo r ld t h e model lives in. W e - t h e b u ilde r s , t h e enginee r s , a nd t h e c r e a to r s - k now t h e t ru t h: t h e best AI s y stems ar en ’ t bo r n fr om bigge r models , b u t fr om bette r enginee r ing . W e c a n ’ t w a it to see w ha t y o u b u ild Context engineering is about more than just prompting large language models, building retrieval systems, or designing AI architectures. It’s about building interconnected, dynamic systems that reliably work across a variety of uses and users. All the components described in this ebook will continue to evolve as new techniques, models, and discoveries are made, but the difference between truly functional systems and the AI apps that fail will be how well they engineer context across their entire architecture. We are no longer thinking in terms of just prompting a model, we’re looking at how we architect entire context systems. Simple Prompt Engineering Context window Us er LLM M emory AI A gent W ea v iate V e c tor D atat b a s e T ool s Context Engineering Possible context to give model System Prompt Doc Doc Doc User Message Tool Tool Tool Assistant Message G lo ss ar y Context window System Prompt Doc Doc Assistant Message Curation Tool Memory File Memory File Tool Comprehensive Instructions Tool Call Tool User Message Domain Knowledge Memory File Doc Ready to build the next generation of AI applications? Start today with a 14 day free trial of Weaviate Cloud (WCD). Message History Tool Message History Try N ow Contact Us Tool Result Visual inspired by Effective context engineering for AI agents, Anthropic 40 Context engineering 41