Weaviate Context Engineering
如果无法正常显示,请先停止浏览器的去广告插件。
1. Context
Engineering
Designing the systems that control what
information reaches the model and how it
maintains coherence.
2. Ta b le of C ontents
What Is Context Engineering? 01
What Are Agents?
The Context Window Challenge
Strategies and Tasks for Agents
Where Agents Fit in Context Engineering 04
05
07
08
Q uery R ewriting Q uery Agents 10
11
12
13
A Guide to Chunking Strategies
Simple Chunking Strategies
Advanced Chunking Strategies
Pre-Chunking vs. Post-Chunking
Summary 15
17
18
20
23
Classic Prompting Techni q ues
Advanced Prompting Techni q ues
Prompting for Tool Usage
Using Prompt Frameworks 25
26
26
27
The Architecture of Agent Memory
K ey Principles for Effective Memory Management 29
31
The Evolution: From Prompts to Actions
The Orchestration Challenge
The Next Frontier of Tool Use 35
36
39
The Future of AI Engineering 40
Introduction
Agents
Q uery Augmentation
Retrieval
P rom p ting Tec h ni q ues
M emory
Tools
Summary
Context engineering
Q uery Expansion
Q uery D ecomposition
3. Introduction
Every developer who builds with Large Language Models (LLMs) eventually hits the same
wall. You start with a powerful model that can write, summarize, and reason with
stunning capability. But when you try to apply it to a real-world problem, the cracks start
to appear. It can't answer questions about your private documents. It has no knowledge
of events that happened yesterday. It confidently makes things up when it doesn't know
an answer.
Vector search
(retrieval)
MCP
1
User
Agentic coordination
& decision making 3
6
2
Input
Agent
5
Short-Term Memory
Prompt
7
Store all context in
chat history
01
RAG
Databases
Answer
8
You can't fix this fundamental limitation by j ust
writing better prompts. You have to build a
system around the model .
That is Context Engineering.
C ontext Engineering is the discipline of
Long-Term Memory
Add to
memory
The problem isn't the model's intelligence. The
problem is that it's fundamentally disconnected.
It's a powerful but isolated brain, with no access
to your specific data, the live internet, or even a
memory of your last conversation. This isolation
is a direct result of its core architectural limit :
the context window . The context window is the
model's active working memory — the finite
space where it holds the instructions and
information for the current task. Every word,
number, and piece of punctuation consumes
space in this window. J ust like a whiteboard,
once it ’ s full, older information gets erased to
make room for new instructions, and important
details can be lost .
4
Update prompt
with answer
Action Tools
designing the architecture that feeds an LLM
the right information at the right time. It ’ s not
about changing the model itself, but about
building the bridges that connect it to the
outside world, retrieving external data,
connecting it to live tools, and giving it a
memory to ground its responses in facts, not
j ust its training data .
This ebook is the blueprint for that system. W e
will cover the core components required to turn
a brilliant but isolated model into a reliable,
production-ready application .
Mastering these components is the difference
between a reasonable demo and a truly
intelligent system. Let's get to work.
Context engineering
Ag e nts
The decision-making brain that
orchestrates how and when to use
information.
Qu ery
Aug me ntati o n
The art of translating messy,
ambiguous user requests into
precise, machine-readable intent.
R e t r i e val
The bridge connecting the LLM to
your specific documents and
knowledge bases.
P rom pting
T e chniqu e s
The skill of giving clear, effective
instructions to guide the model's
reasoning.
Memory
The system that gives your
application a sense of history and
the ability to learn from interactions.
T oo ls
The hands that allow your
application to take direct action and
interact with live data sources.
02
4. What Are Agents?
The term "agent" gets used broadly, so let’s define it in the context of building with large
language models (LLMs). An AI agent is a system that can:
1
Agents
2
Make dynamic decisions about information
flow. Rather than following a predetermined
path, agents decide what to do next based
on what they've learned.
Maintain state across multiple interactions.
Unlike simple Q&A systems, agents
remember what they've done and use that
history to inform future decisions.
S earch
Memory
tools
NO
User
query
Answered
before?
Additional
information
required?
NO
As soon as you start building real systems with large language models, you run into the
limits of static pipelines. A fixed recipe of “retrieve, then generate” works fine for simple
Retrieval Augmented Generation (RAG) setups, but it falls apart once the task requires
judgment, adaptation, or multi-step reasoning.
This is where agents come in. In the context of context engineering, agents manage how
(and how well) information flows through a system. Instead of blindly following a script,
YES
Decompose
query into
sub-queries
Each
sub-query
NO
Q uery routing
& processing
Retrieved
information
relevant?
YES
G enerate
Response
Legend
YES
response
4
Reasoning
agents can evaluate what they know, decide what they still need, select the right tools, Tool Use
and adjust their strategy when things go wrong.
Memory
Modify their approach based
on results. When one strategy
isn’t working, they can try
different approaches.
3
Use tools adaptively. They can
select from available tools and
combine them in ways that
weren’t explicitly programmed.
Agents are both the architects of their contexts and the users of those contexts.
However, they need good practices and systems to guide them, because managing
context well is difficult, and getting it wrong quickly sabotages everything else the agent
can do.
Single-Agent Architecture
Attempt to handle all tasks themselves, which
works well for moderately complex workflows.
Prompt
Multi-Agent Architecture
User
Response
AI Agent
Thinking...
Distribute work across specialized agents per
task. Allows for complex workflows but introduces
coordination challenges.
03
Context engineering
04
5. Here are some common types of errors that begin to happen or increase
as context window size grows:
Context Hygiene
This is one of the most critical parts of managing agentic systems. Agents don’t just need
memory and tools; they also need to monitor and manage the quality of their own
context. That means avoiding overload, detecting irrelevant or conflicting information,
pruning or compressing as needed, and keeping their in-context memory clean enough to
Context
Context
reason effectively.
P oi s oning
Incorrect or hallucinated information
enters the context.
B ecause agents
reuse and build upon that context,
these errors persist and compound.
The Context Window Challenge
LLMs have limited information capacity because the context window can only
Context
hold so much information at once. This fundamental constraint shapes what
D i s t r a c tion
agents and agentic systems are currently capable of.
The agent becomes burdened by too much
past information
Every time an agent is processing information, it needs to make decisions about:
summaries
— history, tool outputs,
— and over-relies on repeating
past behavior rather than reasoning fresh.
What information should remain What should be stored externally
active in the context window and retrieved when needed
What can be summarized or How much space to reserve for
compressed to save space reasoning and planning
Context
fus ion
Context Con
Irrelevant tools or documents crowd the
context, distracting the model and causing
it to use the wrong tool or instructions.
It’s tempting to assume that bigger context windows solve this
problem, but this is simply not the case. Longer contexts (hundreds of
thousands or even ~1M tokens) actually introduces new failure modes.
Performance often begins to degrade far before the model reaches
Context
maximum token capacity, where agents will become confused, have
C ontradictory information within the
higher rates of hallucination, or simply stop performing at the level
context misleads the agent, leaving it
they’re normally capable of. This isn’t just a technical limitation, it’s a
stuck between conflicting assumptions.
core design challenge of any AI app.
05
s
Context Cla h
Context engineering
06
6. Strategies and
Tasks for Agents
Agents are able to effectively orchestrate context
systems because of their ability to reason and
make decisions in a dynamic way. Here are some
of the most common tasks agents are built for
and employ to manage contexts.
Context Summarization: Quality Validation:
Periodically compressing Checking whether
accumulated history into retrieved information is
summaries to reduce consistent and useful.
Where Agents Fit in Context Engineering
Agents serve as coordinators in your context engineering system. They don’t replace the
techniques covered in other sections, instead, they orchestrate them intelligently. An
agent might apply query rewriting when initial searches are unsuccessful, choose
different chunking strategies based on the type of content it encounters, or decide when
conversation history should be compressed to make room for new information. They
provide the orchestration layer needed to make dynamic, context-appropriate decisions
about information management.
Different types of agents and functions within a context engineering system:
burden while preserving
key knowledge.
UP E R VI S O RS
S
Context Pruning:
Route to
Specialized
Planning
Actively removing
irrelevant or outdated
context, either with
specialized pruning models
G enerate Final
Response
Refine Q uery
or a dedicated LLM tool.
P E CIALIZ E D AG E NT S
Send Retrieval
Request
S
Adaptive Retrieval Strategies:
Q uery Rewriter
Reformulating queries, switching
knowledge bases, or changing chunking
strategies when initial attempts fail.
Context Offloading: Storing details
Retriever
externally and retrieving them only
Answer
Synthesizer
when needed, instead of keeping
Failed
Data Collection
Selector
everything in active context.
Tool Router
MEMO R Y
CAPABILITI E S AND
KN O WL E DG E S O U R C E S
short TERM MEMORY
Change strategy
Send Context
for Synthesis
Dynamic Tool Selection: Instead of dumping M ulti - Source Synt h e s i s :
every possible tool into the prompt, agents Combining information from
filter and load only those relevant to the task. multiple sources, resolving
This reduces confusion and improves accuracy. conflicts, and producing
Return Tool
Results
Compressor
E X TERN A L K NO W LE D GE
S O U R C E S
W orking M emory
Return
Retrieved
Facts
coherent answers.
LONG TERM MEMORY
Vector DB Episodic
and Factual Store
Tools and AP I s
Sync
Episodic
M emory
Vector DB
K nowledge
Collections
W eb and Search
AP I s
Q uery and U pdate
07
Context engineering
08
7. Query Rewriting
Query
Augmentation
Query rewriting transforms the original user query into a more effective version for
information retrieval. Instead of just doing retrieve-then-read, applications now do a
rewrite-retrieve-read approach. This technique restructures oddly written questions so
they can be better understood by the system, removes irrelevant context, introduces
common keywords that improve matching with correct context, and can split complex
questions into simpler sub-questions.
Raw Query
One of the most important steps of context engineering is how you prepare and present
the user's query. Without knowing exactly what the user is asking, the LLM cannot
provide an accurate response.
Query Re-writer (LLM)
Rewritten Query
query=”API call failure,
troubleshooting
authentication headers,
rate limiting, network
timeout, 500 error”
How do i make
this work when
my api call keeps
failing?
Though this sounds simple, it's actually quite complex. There are two main issues to think
about:
1
Users often don't interact with
chatbots or inputs in the ideal way.
Many product builders will often develop
and test chatbots with queries that provide
the request and all additional information
that the LLM would need to understand the
question in a succinct, perfectly
punctuated, clear way. Unfortunately, in the
real world, user interactions with chatbots
can be unclear, messy, and not complete. In
order to build robust systems, it's important
to implement solutions that deal with all
types of interactions, not just ideal ones.
2
Different parts of the pipeline need to
deal with the query in different ways.
A question that an LLM could understand
well might not be the best format to search
through a vector database with. Or, a query
term that works best for a vector database
could be incomplete for an LLM to answer.
Therefore, we need a way to augment the
query that suits different tools and steps
within the pipeline.
Remember that query augmentation addresses the "garbage in, garbage out" problem at
the very start of your pipeline. No amount of sophisticated retrieval algorithms, advanced
reranking models, or clever prompt engineering can fully compensate for misunderstood
user intent.
09
RAG applications are sensitive to the phrasing and specific
keywords of the query, so this technique works by:
Restructuring
Unclear Questions: Context
Removal: Keyword
Enhancement:
Transforms vague or
poorly formed user
input into precise,
information-dense
terms. Eliminates irrelevant
information that
could confuse the
retrieval process. Introduces common
terminology that
increases the
likelihood of matching
relevant documents.
Context engineering
10
8. Query Expansion Query D e c o m posi t ion
Query expansion enhances retrieval by generating multiple related queries from a single
user input. This approach improves results when user queries are vague, poorly formed,
or when you need broader coverage, such as with keyword-based retrieval systems. Query decomposition breaks down complex, multi-faceted questions into simpler,
focused sub-queries that can be processed independently. This technique is especially
good for questions that require information from multiple sources or involve several
related concepts.
R a w Quer y
Quer y R e- W riter (LLM)
Expanded Queries
Natural language processing tools
Free nlp libraries
Open source
NLP tools
Open source language processing platforms
NLP soft w are w it h open source co d e
C ontext W indow
However, query expansion comes with challenges
that need careful management:
Query Drift: Over-Expansion:
Expanded queries
may diverge from the
user's original intent,
leading to irrelevant or
off-topic results. Adding too many
terms can reduce
precision and retrieve
excessive irrelevant
documents.
Computational
Overhead:
Processing multiple
queries increases
system latency and
resource usage.
T he process typically involves two main stages:
De c omposition P hase: P ro c essin g P hase:
An LLM analy z es the original
complex query and breaks it into
smaller, focused sub-queries. Each
sub-query targets a specific aspect
of the original question. Each sub-query is processed
independently through the retrieval
pipeline, allowing for more precise
matching with relevant documents.
After retrieval, the context engineering system must aggregate and synthesi z e results
from all sub-queries to generate a coherent, comprehensive answer to the original
complex query.
11
Context engineering
12
9. Query Agents
Query
Agents
advanced
are
form
the
of
User query
most
query
Analysis
augmentation, using AI agents
Analysis: Use generative models (e.g. large language models) to analyze the
task & the required queries. Determine the exact queries to perform.
to intelligently handle the entire
query
processing
combining
the
pipeline,
Construct query
techniques
above.
A
query
agent
takes
prompt/question
a
in
user’s
Search
natural
Aggregation
Dynamic Query Construction: Rather than using predetermined query patterns,
the agent constructs queries on-demand based on understanding both the user
intent and the data schema. This means it can add filters and adjust search
terms automatically to find the most relevant results in the database, as well as
choosing to run searches, aggregations or even both at the same time for you.
language and decides the best
way
to
structure
the
query
based on it’s knowledge of the
database
and
data
Execution
structure,
Query execution: Formulates and sends queries to the agent’s chosen
collection or collections.
and can iteratively decide to re-
query and adjust based on the
Choose collection
results returned.
l i
ll i n rout in g : The agent understands the structure of all of your
collections, so it can intelligently decide which data collections to query based
on the user ' s question.
Mu t -co ect o
ector data b ase
ector data b ase
V V
Collection A Collection B
al u a t i o n: The agent can evaluate the retrieved information within the
context of the original user query. I f there is missing information, the agent
can try a different knowledge source or new query.
Ev
Retrieved
information
relevant?
NO
YES
Finish
Generate a
text
response?
YES
Response
13
tiona l) R es p onse g eneration: Receive the results from the database, and use
a generative model to generate the final response to the user’s prompt / query.
(Op
Contextua l a w areness: The context may also include previous conversation
history, and any other relevant information. The agent can maintain
conversation context for follow-up questions.
NO
Finalize context
Context engineering
10. The challenge is simple in concept but tricky in practice: a raw dataset of documents is
almost always too large to fit into an LLM's limited context window (the inputs given to an
AI model). We can't just hand the model an entire set of user manuals or research papers.
Instead, we must find the perfect piece of those documents, the single paragraph or
section that contains the answer to a user's query.
To make our vast knowledge bases searchable and find that perfect piece, we must first
break our documents down into smaller, manageable parts. This foundational process,
known as chunking, is the key to successful retrieval.
Retrieval
The Context Window
Document
S ystem prompt
You are a helpful AI assistant...
U ser q uery
Chunk
What is a vector database?
A Large Language Model is only as good as the information it can access. While LLMs are
Rele v ant chunks
trained on massive datasets, they lack knowledge of your specific, private documents Doc 1; Chunk 1 Doc 2; Chunk 1
and any information created after their training was completed. To build truly intelligent Doc 2; Chunk 2 Doc 3; Chunk 1
applications, you need to feed them the right external information at the right time. This
process is called Retrieval. Pre-Retrieval and Retrieval steps make up the first parts of
many AI architectures that rely on context engineering, such as Retrieval Augmented
Generation (RAG).
Documents
Learn how chunking strategies can help improve your RAG performance and explore different chunking
methods. Read the complete blog post here: weaviate.io/blog/chunking-strategies-for-rag
Chunks
Pre-Retrieval
A Guide to Chunking Techniques
Query
Retrieval
Embedding Model
Chunking is the most important decision you will make for your retrieval system's
performance.
It
is
the
process
of
breaking
down
large
documents
into
smaller,
manageable pieces. Get it right, and your system will be able to pinpoint relevant facts
Vector Database
with surgical precision. Get it wrong, and even the most advanced LLM will fail.
Context
Prompt Template
Chunks
Response
LLM
Post-Chunking
14
Context engineering
15
11. Simple Chunking Techniques
The Chunking Strategy Matrix
high
low
ixed - S i z e Chun k in g : The simplest method. The text is split into chunks of a
predetermined size (e.g., 512 tokens). It's fast and easy but can awkwardly cut sentences
in half. U sing an overlap (e.g., 50 tokens) between chunks helps mitigate this.
F
Precise but Incomplete
The Sweet Spot (Optimal Chunks)
Overly small chunks (e.g.,
single sentences) that are easy
to find but lack the context for
the LLM to generate a good
response. Semantically complete paragraphs
that are focused enough to be
found and rich enough to be
understood.
P ho o
t
c
h es i s i s o ne o f nature's m o st v i ta l pr o cesses.
h unk 1
c
h unk 2
c
o ver l ap
he F ailure Z one
oorly constructed, random
chunks that are neither
findable nor useful, the worst
of both worlds.
P
h unk 3
o ver l ap
Rich but Unfindable
T
Oversized chunks that contain
the answer but have "noisy"
embeddings, making them
impossible for the retrieval
system to find accurately.
Contextual richness
Recursive Chun k in g : A more intelligent approach that splits text using a prioritized list of
separators (like paragraphs, then sentences, then words). It respects the document's
natural structure and is a solid default choice for unstructured text.
high
gl ement i s a key c o ncept i n quantum
p h ys i cs. It o ccurs wh en part i c l es bec o me li nked, s o t h e
state o f o ne i nstant l y affects t h e state o f an o t h er, n o
matter t h e d i stance bet w een t h em
Quantum entan
When designing your chunking strategy, you must balance two competing priorities:
Retrieval Precision: Chunks need to be small and focused on a single idea. This
creates a distinct, precise embedding, making it easier for a vector search system to
find an exact match for a user's query. Large chunks that mix multiple topics create
"averaged," noisy embeddings that are hard to retrieve accurately.
Contextual Richness: Chunks must be large and self-contained enough to be
understood. After a chunk is retrieved, it is passed to the LLM. If the chunk is just an
isolated sentence without context, even a powerful model will struggle to generate a
meaningful response.
The goal is to find the "chunking sweet spot", creating chunks that are small enough for
precise retrieval but complete enough to give the LLM the full context it needs.
Your choice of strategy will depend on the nature of your documents and the needs of
your application.
1 6
synt
hi s c o nnect io n c h a ll en g es o ur understand i n g o f space
and t i me. W h en y o u measure o ne entan gl ed part i c l e, t h e
o t h er's state c h an g es i nstant l y.
T
1 D i
( g
2 Sp t t
3 If c
4
ef ne a
e.
hi erarc h y o f separat o rs
g rap h s → sentences → wo rds )
., para
li h e text us i n g t h e high est- l eve l separat o r
h unks are t oo b ig , sp li t a g a i n us i n g t h e
o r
next separat
R
epeat unt il a ll c h unks f i t wi t hi n t h e des i red
i z e whil e preserv i n g mean i n g
s
ocument -B ased Chun k in g : This method uses the document's inherent structure. F or
example, it splits a Markdown file by its headings ( # , ## ), an H TML file by its tags
( < p > , < div > ), or source code by its functions.
D
1
# H
i g T hi s i s a h ead i n g . -- ## Sub h ead i n g T hi s i s a
sub h ead i n g . We can c o nt i nue wi t h m o re c o ntent h ere. --
## T hi s i s a sec o nd sub h ead i n g H ere i s d i fferent c o ntent.
ead n
(
e.
2 G o
3 V
4
Context engineering
i logi ca l d o cument b o undar i es
g ., c h apters, sect io ns, h ead i n g s )
Ident fy
c
r up c o ntent under eac h b o undary i nt o
oh es i ve un i ts
o r i z e eac h un i t as a standa lo ne c h unk
ect
o re c h unks wi t h metadata li nk i n g t h em
o t h e i r s o urce d o cument and sect io n
St
t
1 7
12. Advanced Chunking Techniques
Hierarchical Chunking: Creates multiple layers of chunks at different levels of detail (e.g.,
top-level summaries, mid-level sections, and granular paragraphs). This allows a retrieval
S
Instead of using separators, this technique splits text based on
meaning. It groups semantically related sentences together and creates a new chunk
only when the topic shifts, resulting in highly coherent, self-contained chunks.
emantic Chunking:
system to start with a broad overview and then drill down into specifics as needed.
Title
Chunk 1
Title
through the
E
v v
arth and atmosphere. It in ol es processes such as
v
turning it into
Split te t into sentences or paragraphs
2 V
aporation
occurs when the sun heats up water in ri ers, lakes, or oceans,
x
1
E v
v
e aporation, condensation, precipitation, and collection.
Documents
z
E v
v
entually, the clouds become hea y and
Intro
Intro
Chunk 2
Chunk 3
Method
Method
ectori e windows of sentences
v apor or steam. This v apor rises into the air and cools
down, forming clouds.
Abstract
Abstract
v
The water cycle is a continuous process by which water mo es
Chunk 4
Reference
Reference
3 Calculate cosine distance between all pairs
4 Merge until breakpoint is reached
Chunk 5
water falls back to the earth as precipitation, which can be rain,
snow, sleet, or hail. This water then collects in bodies of water,
continuing the cycle.
Late Chunking: An architectural pattern that inverts the standard process. It embeds
-B ased
LLM
Chunking:
U ses
a
L arge L anguage M odel
the entire document first to create token-level embeddings with full context. Only then is
to intelligently process a document
and generate semantically coherent chunks. Instead of relying on fixed rules, the
LLM
can
z
the document split into chunks, with each chunk's embedding derived from these pre-
computed, context-rich tokens.
identify logical propositions or summari e sections to create meaning-preserving pieces.
Input text
LLM
Propositions
“Alice went for a walk in the woods one day
and on her walk, she spotted something. She
x v isited the library.
x v isited the library.
H e lo v es reading.
Ale
Ale
saw a rabbit hole at the base of a large tree.
1
She fell into the hole and found herself in a
x lo v es reading.
Ale
strange new world.”
2
Agentic Chunking:
T his
takes the concept a step further then
LLM-B ased C hunking. An
AI
Chunk the token embeddings
( instead of the raw te x t ) .
Embedding Model
Preserve context because the
z
agent dynamically analy es a document's structure and content to select the best chunking
(
Embed the entire document using a
long-context model to generate
token-level embeddings .
embeddings were created with full
) to apply for that specific document.
strategy or combination of strategies
3
x
document conte t, each token
v
preser es its relationship to tokens in
neighboring chunks.
Documents
Selecting Method
HTML
1 8
Pool strategically instead of pooling all
tokens into one v ector, late chunking
e Chunking
D ocument - based
4
Chunking
z
x
-
aware embeddings per document.
Approach
Analy es document
pools tokens according to your chunking
strategy to get multiple conte tually
H
ybrid
PDF
Docx
Optimized
Chunks
z
F i x ed - si
Markdown
Docx
Chunks
Semantic Chunking
format and content
Context engineering
1 9
13. Pre-Chunking vs. Post-Chunking
Pos t -Chunking
Beyond how you chunk, a key system design choice is when you chunk. This decision
leads to two primary architectural patterns.
An ad v anced, real - time alternati v e where chunking happens a f ter a document has b een
retrie v ed, in direct response to a user ' s q uery.
Pre-Processing
Pre-Chunking
Documents
Clean text (remove headers,
footers, special characters, etc.)
The most common method, where all data processing happens up f ront and o ff line,
b e f ore any user q ueries come in.
Retrieval by semantic similarity
Query
Pre-Processing
Clean text (remove headers,
footers, special characters, etc.)
Documents
Embedding Model
Pre-Chunking
Chunks
Vector Database
Split documents into smaller chunks (e.g., 500 tokens
per chunk, semantic chunks, hierarchical chunks).
Retrieval by semantic similarity
Query
Pos t -Chunking
Retrieved Context
Embedding Model
Retrieve full documents, then chunk and rerank before adding
to LLM context w indo w . Store chunked documents.
Chunks
Prompt Template Augmented
Large Language Model Generation
Vector Database
Embedded Chunks
Output
Retrieved Context
Output
ork fl o w
W
Prompt Template Augmented
Large Language Model Generation
ork fl o w
W
C ON
Clean Data -> Chunk Documents -> Embed & Store Chunks
P RO
etrie v al is e x tremely f ast at q uery
time b ecause all the work has already
b een done. The system only needs to
per f orm a q uick similarity search.
2 0
R etr i e v e R ele v ant Document -> Chunk D y nam i call y
P RO
t s highly f le x i b le. Y ou can create
dynamic chunking strategies that are
speci f ic to the conte x t o f the user ' s
q uery, potentially leading to more
rele v ant results.
I '
R
Store Documents ->
C ON
The chunking strategy is f i x ed. If you
decide to change your chunk si z e or
method, you must re - process your
entire dataset.
t adds latency. The chunking
process happens in real - time, making
the f irst response slower f or the end -
user. I t also re q uires more comple x
in f rastructure to manage.
I
We built a post-chunking strategy into Elysia, our open source agentic RAG framework.
You can read more about that here:
https://weaviate.io/blog/elysia-agentic-rag#chunk-on-demand-smarter-document-processing
Context engineering
21
14. Guide to Choosing Your Chunking Strategy
Chunking
Strategy How It Works
Fixed-Size Recursive
Co mpl e x ity
B est F or Ex a mpl es
Splits by token or
character count. Small or simple
docs, or when speed
matters most. M eeting notes,
short blog posts,
emails, simple F A Q s.
Splits text by
repeatedly dividing
it until it fits the
desired chunk size,
often preserving
some structure. D ocuments where
some structure
should be
maintained but
speed is still
important. R esearch articles,
product guides,
short reports.
Document-
Based Splits only at
document
boundaries or by
structural elements
like headers. C ollections of
short, standalone
documents or
highly structured
files. N ews articles,
customer support
tickets, M arkdown
files.
Semantic Splits text at natural
meaning boundaries
(topics, ideas). T echnical, academic,
or narrative
documents where
topics shift without
clear separators. Scientific papers,
textbooks, novels,
whitepapers.
LLM-Based Uses a language
model to decide
chunk boundaries
based on context
and meaning. C omplex text where
meaning - aware
chunking improves
downstream tasks
like Q& A. Long reports, legal
opinions, medical
records.
Agentic Lets an AI agent
decide how to split
based on meaning
and structure. C omplex, nuanced
documents that
re q uire custom
strategies. R egulatory filings,
multi - section
contracts,
corporate policies.
Late
Chunking Embeds the whole
document first, then
derives chunk
embeddings from it. Use cases where
chunks need
awareness of the full
document ' s context. C ase studies,
comprehensive
manuals, long - form
analysis reports.
Hierarchical B reaks text into
multiple levels
(sections →
paragraphs →
sentences). Large, structured
documents where
both summary and
detail are needed. Employee
handbooks,
government
regulations,
software
documentation.
22
Su mm ary
The effectiveness of your Retrieval Augmentation system is not determined by a single
“magic” bullet, but by a series of deliberate engineering choices. The quality of the
context you provide to an LLM is a direct result of two key decisions:
The Chunking Strategy The Architectural Pattern
T he " Ho w" T he "W hen "
The method you choose to break
down your documents. The point at which you perform
the chunking.
Mastering these two elements is fundamental to context engineering. A well-designed
retrieval system is the difference between an LLM that guesses and one that provides
fact-based, reliable, and contextually relevant answers.
Context engineering
23
15. Classic Prompting Techniques
Chain of Thought
Prompt
Prompting
Techniques
Prompt engineering is the practice of designing, refining, and optimizing inputs (prompts)
given to Large Language Models (LLMs) to get your desired output. The quality and
effectiveness of LLMs are heavily influenced by the prompts they receive, and the way
you phrase a prompt can directly affect the accuracy, usefulness, and clarity of the
response.
It’s essentially about interacting with AI efficiently: giving it instructions, examples, or
questions that guide the model toward the output you need.
In this section, we’ll go over prompting techniques that are essential for improving
Retrieval-Augmented Generation (RAG) applications and overall LLM performance.
o
t
Th ugh
...
o
t
Th ugh
Response
This technique involves asking the model
to “think step-by-step” and break down
complex reasoning into intermediate
steps. This is especially helpful when
retrieved documents are dense or contain
conflicting information that requires
careful analysis. By verbalizing its
reasoning process, the LLM can come at
more accurate and logical conclusions.
Few-Shot Prompting
This approach provides the LLM with a
few examples in the context window that
demonstrate the type of output or
“golden” answers you want. Showing
examples helps the model understand
the desired format, style, or reasoning
approach, improving response accuracy
and relevance, especially for specialized
or technical domains.
Prompt
mp l e
mp l e
Exa
I
np u t
D es i re d
o u t c ome
mp l e
Exa
I
np u t
D es i re d
o u t c ome
Exa
I
np u t
D es i re d
o u t c ome
p a ttern re c o g n i t i on
orm a t , st yl e , l o gic)
LLM:
(f
Response
Combining CoT and Few-shot examples is a powerful way to guide both the model’s
reasoning process and its output format for maximum efficiency.
Important Note: Prompt engineering focuses on how you phrase
instructions for the LLM. Context engineering, on the other
hand, is about structuring the information and knowledge you
provide to the model, such as retrieved documents, user
history, or domain-specific data, to maximize the model’s
understanding and relevance. Many of the techniques below
(CoT, Few-shot, ToT, ReAct) are most effective when combined
with well-engineered context.
24
Pro Tip #1:
Make the model reasoning in Chain of
Thought very specific to your use-case.
For example, you might ask the model to:
valuate the environmen t
Repeat any relevant informatio n
E xplain the importance of this information
to the current request
E
Context engineering
Pro Tip #2:
Maximize efficiency and reduce
token count, asking the model to
reason in a " draft " form, using no
more than 5 words per sentence.
This makes sure that the model ' s
thought process is visible while
reducing output token count.
25
16. Advanced Prompting Strategies
Building on classic techniques, advanced strategies guide LLMs in more sophisticated ways:
This very precise guidance, which should be included as part of your overall tool
description, helps the LLM understand the exact boundaries and functionalities of each
available tool, minimi ing error, and improving overall system reliability.
z
R e Ac t P ro mp t in g:
Tree of Thoughts (ToT):
P ro T ip : H o w to W r i te a n Eff e c t i v e Too l D es c r ip t i o n
The LLM s decision to use your tool depends entirely on its
description. Make it count:
'
Use an Active Verb:
S tart with a clear action.
get_current_weather is better than weather_data.
e ecific out n uts l arl stat
at ar gu m en ts
t he tool ex p ec ts a nd t hei r f ormat (e.g. , ci t y ( str ing) ,
d at e ( str ing , YYYY-MM-DD)).
B
ToT builds on CoT by instructing the model
to explore and evaluate multiple reasoning
paths in parallel, much like a decision tree.
The model can generate several different
solutions to a problem and choose the best
result. This is especially useful in RAG
when there are many potential pieces of
evidence, and the model needs to weigh
different possible answers based on
multiple retrieved documents.
This framework combines CoT with agents,
enabling the model to " Reason " ( think ) and
" Act " dynamically. The model generates both
reasoning traces and actions in an interleaved
manner, allowing it to interact with external
tools or data sources and ad j ust its reasoning
iteratively. ReAct can improve RAG pipelines by
enabling LLMs to interact with retrieved
documents in real time, updating reasoning
and actions based on external knowledge to
give more accurate and relevant responses.
Prompting for Tool Usage
When your LLM interacts with external tools, clear prompting ensures correct tool
selection and usage.
Defining Parameters and Execution Conditions
LLMs can sometimes make incorrect tool selections or
use tools in suboptimal ways. To prevent this, prompts
should clearly define:
Wh e n to u se a too l:
S pecif y scenarios or
conditions that trigger
a particular tool.
26
How to use a tool:
Provide expected
inputs, parameters, and
desired outputs.
Examples: Include few-shot examples
showcasing correct tool selection and
usage for various queries. For instance:
User Query:
"What's the weather like in Paris?" ->
Use Weather_API with city="Paris"
User Query:
"Find me a restaurant near the Eiffel
Tower." -> Use Restaurant_Search_Tool
with location="Eiffel Tower"
Sp
Ab
I
p
Descri e t e ut ut
b
h
O
p
: C e
y
e wh
: Tell the model what to expect in
return (e.g., returns a JSON object with "high", "low",
and "conditions".).
ention imitations
l nly wor ks for a s pec i f i c
reg i o n or t i me fr a me, say s o (e.g., Note: O nly wor ks for
c i t i e s in the U S A.).
M
L
: If the too o
Using Prompt F rame w or k s
I f you are building a pro j ect that requires extensive prompting or want to systematically
improve your LLM results, you could consider using frameworks like: DSPy, Llama
Prompt Ops, Synalinks.
That said, you don ’ t necessarily need to use a framework. F ollowing the prompting
guidelines outlined ( clear instructions, Chain of Thought, F ew - shot Learning, and
advanced strategies ) can achieve highly effective results without additional frameworks .
Think of these frameworks as optional helpers for complex pro j ects, not a requirement
for everyday prompt engineering.
Context engineering
7
2
17. This is where context engineering becomes an art. The goal isn’t to shove more data into
the prompt but to design systems that make the most of the active context window -
keeping essential information within reach while gracefully offloading everything else
into smarter, more persistent storage.
Context Offloading is the practice of storing information outside the LLM’s active context window,
often in external tools or vector databases. This frees up the limited token space so that only the most
relevant info stays in context.
Memory
The Architecture of Agent Memory
Memory in an AI agent is all about retaining information to navigate changing tasks,
remember what worked (or didn't), and think ahead. To build robust agents, we need to
think in layers, often blending different types of memory for the best results.
When you're building agents, memory isn't just a bonus feature - it's the very thing that
breathes life into them. Without it, an LLM is just a powerful but stateless text processor
that responds to one query at a time with no sense of history. Memory transforms these
models into something that feels more dynamic and, dare we say, more ‘human’, that’s
capable of holding onto context, learning from the past, and adapting on the fly.
Andrej Karpathy gave us the perfect analogy when he compared an LLM’s context
window to a computer’s RAM and the model itself to the CPU. In this view, the context
window is the agent's active consciousness, where all its "working thoughts" are held.
But just like a laptop with too many browser tabs open, this RAM can fill up fast. Every
message, every tool output, every piece of information consumes precious tokens.
Peri p her al
Short-Term Memory
C o n te x t Wind o w
User: “What’s the weather?”
AI: “It’s sunny, 24°C”
AI: “ No nee d , i t’s war m! ”
S hort-term
Sof t wa re 1.0 t ools
“cla ssi cal compu ter ”
Calculator Pytho n
Interpreter Terminal
Disk
F ile S yste m
( + embeddings)
Audio
CPU
LLM
Ethernet
Browser
RAM
Contex t
W indow
O ther LLMs
Ex ter nal Stor ag e
V ector
D atabase
Episodic
S emantic
Procedural
User: “ S h o u ld I b r i n g a j a ck et?”
d e v i c es I/O
V ideo
L o ng -Term Memory
memory is the agent's
immediate workspace. It's the "now,"
stuffed into the context window to fuel
on-the-fly decisions and reasoning. This
is powered by in-context learning, where
you pack recent conversations, actions,
or data directly into the prompt.
Because it's constrained by the model's
token limit, the main challenge is
efficiency. And, the trick is to keep this
streamlined to reduce costs and latency
without missing any details that might be
important for the next processing steps.
Long-term memory moves past the immediate
context window, storing information externally
for quick retrieval when needed. This is what
allows an agent to build a persistent
understanding of its world and its users over
time. It's commonly powered by Retrieval-
Augmented G eneration ( RA G) , where the agent
queries an external knowledge base (like a
vector database) to pull in relevant information.
This memory can store different kinds of
information, like for example : episodic memory to
store specific events or past interactions, or
semantic memory that holds general knowledge
and facts. This could also be information from
company documents, product manuals, or a
curated domain-knowledge base, allowing the
agent to answer questions with factual accuracy.
Source: Andrej Karpathy: Software Is Changing (Again)
28
Context engineering
29
18. Hybrid Memory Setup
In reality, most modern systems use a hybrid approach, blending short-term memory for
speed with long-term memory for depth. Some advanced architectures even introduce
additional layers:
Working Memory: A temporary holding area for information related to a specific,
multi-step task. For example, if an agent is booking a trip, its working memory might
hold the destination, dates, and budget until the task is complete, without cluttering
the long-term store.
Procedural Memory: This helps an agent learn and master routines. By observing
successful workflows, the agent can internalize a sequence of steps for a recurring
task, making it faster and more reliable over time.
o t e
e o
Sh r -T rm M m ry
Immediate reasoning space, bounded by context limit.
Context Window
Query
Thought
Task State
Retrieve
Tool Call
Thought
“Book me a
flight to Tokyo
in December.” “Need to check
budget, dates,
preferences” “Store the task-
specific state in
the working
memory” “User preferences,
travel domain
knowledge,
booking routines,
etc.” “Flights API,
Weather API,
Calendar, etc.” “Review the
context &
decide”
Task State Storage
Task Context Recall
1
the agent sends
Retrieval
the agent pulls the
specific, in-progress
the agent retrieves
relevant task details
back into its active inform its current
temporary scratchpad reasoning space to decision like past travel
to keep the main
context window from
“Respond & then
clear the state &
update memory”
Memory Storage
3
after an interaction,
relevant knowledge to
task details to a
e o
R sp nse
...
or tool call, the agent
saves important
information, new user
continue a multi-step preferences, travel preferences, or
process. domain knowledge successful outcomes/
getting cluttered.
(airlines, airport codes,
E ffective memory management can make or break an LLM agent. P oor memory practices
lead to error propagation, where bad information gets retrieved and amplifies mistakes
across future tasks.
Here a re s ome o f t h e s t a rti ng pri nc ip l e s f or g etti ng t h i ngs ri gh t :
P ru n e an d R e fin e Y our Memorie s
M emory isn ' t a write-once system. It needs regular maintenance. P eriodically
scan your long-term storage to remove duplicate entries, merge related
information, or discard outdated facts. A simple metric for this could be the
recency and retrieval frequency. If a memory is old and rarely accessed, it
might be a candidate for deletion, especially in evolving environments where
old information can become a liability.
For example, a customer support agent might automatically prune
conversation logs that are over 90 days old and marked as resolved, closed,
or no longer active in the memory. It could j ust retain the summaries ( for trend
detection and analysis ) rather than full word-to-word transcripts.
workflows to its
visa rules, etc.), or
2
Key Principles for Effective Memory
Management
4
learned workflows.
permanent memory
for future use.
Merged
Wo rk in g M e m o ry L
on g-T e rm M e m o ry
A temporary scratchpad or buffer box. Persistent storage system to retain and recall
information across sessions.
Task
Context
ete rs
Param
"task_id": "book_flight_001"
"dates":
"departure": "2025-12-15"
"task_status": "in_progress"
"return": "2025-12-22"
Episodic Memory
"task_type": "travel_booking"
"task_context":
( past
"constraints":
"destination": "Tokyo"
"budget_max": 1200,
"origin": "San Francisco"
"preferred_time": "morning",
“tools_available”: “...” "preferred_airlines":
["JAL", "ANA"]
e v ents / interactions / preferences )
Semantic Memory
( general +
ext S te p s /R e s ul t s
domain
Pruned
k no w ledge )
N
"intermediate_results":
"flights_found": 12
"top_candidates":
Procedural Memory
"flight": "JAL005", "price": 1150, ...
"flight": "ANA106", "price": 1180, ...
( learned
routines / decision
w or k flo w s )
"next_steps": ["compare_amenities", "check_baggage_policy",
"confirm_selection"]
30
Context engineering
31
19. Be Selective About What You Store Master the Art of Retrieval
Not every interaction deserves a permanent spot in long-term storage. One must
implement some sort of filtering criteria to assess information for quality and
relevance before saving it. A bad piece of retrieved information can often lead to
context pollution, where the agent repeatedly makes the same mistakes. One
way to prevent this is to have the LLM "reflect" on an interaction and assign an
importance score before committing it to memory. Effective memory is less about how much you can store and more about how well
you can retrieve the right piece of information at the right time. A simple blind
search is often not enough, so advanced techniques like reranking (using an LLM
to re-order retrieved results for relevance) and iterative retrieval (refining/
expanding a search query over multiple steps) can be used to improve the quality
of retrieved information.
Tools like the Q uery Agent and P ersonalization Agent offer these capabilities out
of the box, enabling searches across multiple collections and reranking based on
user preferences and interaction history.
Add to
C onte x t
Deleted
E x panded Queries
etrie v al
R
eran k ing
R
Tailor the Architecture to the Task
There is no one-size-fits-all memory solution. A customer service bot needs a
strong episodic memory to recall user history, while an agent that analyzes
financial reports needs a robust semantic memory filled with domain-specific
knowledge. Always start with the simplest approach that works (like a basic
conversational buffer with last ‘n’ queries/responses) and gradually layer in more
advanced mechanisms as the use case demands it.
Ultimately, memory is what elevates LLM agents from simple responders to intelligent
context-aware systems. Effective memory isn’t simply a passive storage… It’s an active,
managed process! The goal is to build agents that don't just store memory, but
can manage it - knowing what to remember, what to forget, and how to use the past to
reason about the future.
Optional additions:
Episodic
Memory
Tools
Query
Output
Semantic
Memory
32
Query
Augmentation
Context engineering
33
20. The Evolution: From Prompts to Actions
The journey to modern tool use has been a rapid evolution. Initially, devs tried to get
action out of LLMs with good old prompt engineering by tricking the model into
generating text that looked like a command. It was clever but prone to errors.
The real breakthrough was function calling, aka tool calling. This capability, now native to
most models, allows an LLM to output structured JSON that can contain the name of a
Tools
function to call and the arguments to use.
With this, there are a bunch of possibilities:
A Simple Tool
A Chain of Tools
If memory gives an agent a sense of self, then tools are what give it superpowers. By A travel agent bot can use a For a complex request like "Plan a
themselves, LLMs are brilliant conversationalists and text manipulators, but they live search_flights tool, and when a weekend trip to San Francisco for
inside a bubble. They can't check the current weather, book a flight, or look up real-time user asks, "Find me a flight to me," the agent might need to chain
stock prices. They are, by design, disconnected from the living, breathing world of data Tokyo next Tuesday," the LLM several tools together: find_flights,
and action.
doesn't guess the answer. It search_hotels, and
generates a call to the function you get_local_events. This requires the
provided, which in turn queries a agent to reason, plan, and execute
real airline API. a multi-step workflow.
This is where tools come in. A "tool" is anything that connects an LLM agent to the
outside world, allowing it to take direct “action” in the real world and retrieve information
required to fulfill a task. Integrating tools elevates an agent from just being a
knowledgeable consultant to something that can actually get things done.
Context engineering for tools isn't just giving an agent a list of APIs and instructions. It's
about creating a cohesive workflow where the agent can understand what tools are The work of context engineering here is in how you present these tools. A well-written
available, decide correctly which one to use for a specific task, and interpret the results tool description is like a mini-prompt that guides the model, making it crystal clear what
to move forward. the tool does, what inputs it needs, and what it returns.
Query
Thought
Action
Observation
Response
Repeat Until Goal Satisfied
34
Context engineering
35
21. The Orchestration Challenge
Giving an agent a tool is easy (mostly). Getting it to use that tool reliably, safely, and
effectively is where the real work begins. The central task of context engineering
is orchestration, i.e., managing the flow of information and decision-making as the agent
reasons about which tool to use.
This involves a few key steps that happen in the context window. Let’s break down these
key orchestration steps using Glowe, a skincare domain knowledge app powered by our
Elysia orchestration framework, as our running example.
Tool Discovery: The agent needs to know what tools it has at its disposal. This is
usually done by providing a list of available tools and their descriptions in the system
prompt. The quality of these descriptions is very critical. They are the agent's only
guide to understanding what each tool does, allowing the model to understand when
to use a tool and, more importantly, when to avoid it.
Here, the decision agent correctly analyzed the incoming request and selected the product_agent tool.
A rgu m ent F or m ulation ( A ction): O nce a tool is selected, the agent must figure out
what arguments to pass to it. I f the tool is get _ weather(city, date), the agent needs to
extract "S an F rancisco " and " tomorrow " from the user's query and format them
correctly. This could also be a structured request or API call with the necessary
information to use the tool.
In Glowe, we configure a set of specialized tools (Step 5) with precise descriptions when initializing
every new chat tree.
Tool Selection and Planning (Thought): When faced with a user request, the agent
must reason about whether a tool is needed. If so, which one? For complex tasks, it
might even need to chain multiple tools together, forming a plan (e.g., "First, search
the web for the weather; then, use the email tool to send a summary").
36
I n this case, the product_agent required a te x t query f or searching the products collection. N otice
ho w the agent also corrected itsel f ( sel f- healing ) a f ter generating an ill -f ormed argument that initially
caused an error ( another k ey piece o f orchestration ) .
Context engineering
37
22. Reflection (Observation): After executing the tool, the output (the "observation") is
fed back into the context window. The agent then reflects on this output to decide its
next step. Was the tool successful? Did it produce the information needed to answer
the user's query? Or did it return an error that requires a different approach?
The Next Frontier of Tool Use
The evolution of tool use is moving more and more towards standardization. While
function/tool calling works well, it creates a fragmented ecosystem where each AI
application needs custom integrations with every external system. The Model Context
Protocol (MCP), introduced by Anthropic in late 2024, addresses this by providing a
universal standard for connecting AI applications to external data sources and tools.
They call it "USB-C for AI" - a single protocol that any MCP-compatible AI application can
use to connect to any MCP server.
So, instead of building custom integrations for each tool, developers can just create
individual MCP servers that expose their systems through this standardized interface.
Any AI application that supports MCP can then easily connect to these servers using the
JSON-RPC based protocol for client-server communication. This transforms the MxN
integration problem (where M applications each need custom code for N tools) into a
much simpler M + N problem.
Traditional Integration vs MCP Approach
As you can see, orchestration happens through this powerful feedback loop, often called
the Thought-Action-Observation cycle. The agent observes the outcome of its action
and uses that new information to fuel its next "thought," deciding whether the task is
complete, if it needs to use another tool, or if it should ask the user for clarification.
This Thought-Action-Observation cycle forms the fundamental reasoning loop in modern agentic
frameworks like Elysia.
Traditional: NxM Connections
MCP: N+M Connections
MCP
Model 1 D atabase Model 1 Model 2 Cloud Storage Model 2 Cloud Storage
Model 3 G it Model 3 G it
D atabase
E ach model needs custom integration
with each data source M odels and data sources only need
to integrate once with MCP
9 Total Connections 6 Total Connections
Visual inspired by : https://humanloop.com/blog/mcp
This shift towards composable, standardized architectures, where frameworks enable
developers to build agents from modular, interoperable components, represents the
future of AI tooling. It changes the engineer's role from writing custom integrations to
orchestrating adaptive systems that can easily connect to any standardized external
system.
38
Context engineering
39
23. Context enginee r ing is m a de up o f t h e com p onents desc r ibed in t h is eboo k:
Agents to act as the system's decision-making brain.
Query Augmentation to translate messy human requests into actionable intent.
Retrieval to connect the model to facts and knowledge bases.
Memory to give your system a sense of history and the power to learn.
Tools to give your application hands to interact with live data and APIs.
Summary
W e ar e moving on fr om being pr om p te r s w h o t a l k to a model a nd inste a d , becoming
ar c h itects w h o b u ild t h e wo r ld t h e model lives in. W e - t h e b u ilde r s , t h e enginee r s , a nd
t h e c r e a to r s - k now t h e t ru t h: t h e best AI s y stems ar en ’ t bo r n fr om bigge r models , b u t
fr om bette r enginee r ing .
W e c a n ’ t w a it to see w ha t y o u b u ild
Context engineering is about more than just prompting large language models, building
retrieval systems, or designing AI architectures. It’s about building interconnected,
dynamic systems that reliably work across a variety of uses and users. All the
components described in this ebook will continue to evolve as new techniques, models,
and discoveries are made, but the difference between truly functional systems and the AI
apps that fail will be how well they engineer context across their entire architecture. We
are no longer thinking in terms of just prompting a model, we’re looking at how we
architect entire context systems.
Simple Prompt Engineering
Context window
Us er LLM M emory
AI A gent W ea v iate V e c tor
D atat b a s e T ool s
Context Engineering
Possible context to give model
System Prompt Doc Doc Doc
User Message Tool Tool Tool
Assistant
Message
G lo ss ar y
Context window
System Prompt
Doc
Doc
Assistant
Message
Curation
Tool
Memory File
Memory File
Tool
Comprehensive
Instructions
Tool Call
Tool
User Message
Domain Knowledge
Memory File
Doc
Ready to build the next
generation of AI applications?
Start today with a 14 day free trial
of Weaviate Cloud (WCD).
Message History
Tool
Message History
Try N ow
Contact Us
Tool Result
Visual inspired by Effective context engineering for AI agents, Anthropic
40
Context engineering
41