Context Engineering_ Sessions & Memory
如果无法正常显示,请先停止浏览器的去广告插件。
1. Context
Engineering:
Sessions, Memory
Authors: Kimberly Milam and Antonio Gulli
2. Context Engineering: Sessions, Memory
Acknowledgements
Content contributors
Kaitlin Ardiff
Shangjie Chen
Yanfei Chen
Derek Egan
Hangfei Lin
Ivan Nardini
Anant Nawalgaria
Kanchana Patlolla
Huang Xia
Jun Yan
Bo Yang
Michael Zimmermann
Curators and editors
Anant Nawalgaria
Kanchana Patlolla
Designer
Michael Lanning
November 2025
2
3. Table of contents
Introduction 6
Context Engineering 7
Sessions
12
Variance across frameworks and models 13
Sessions for multi-agent systems 15
Interoperability across multiple agent frameworks
19
Production Considerations for Sessions 20
Managing long context conversation: tradeoffs and optimizations 22
Memory
Types of memory
27
34
Types of information 35
Organization patterns 35
Storage architectures 36
Creation mechanisms 37
Memory scope 38
4. Table of contents
Multimodal memory
Memory Generation: Extraction and Consolidation
39
41
Deep-dive: Memory Extraction 44
Deep-dive: Memory Consolidation 47
Memory Provenance 49
Accounting for memory lineage during memory management 50
Accounting for memory lineage during inference 52
Triggering memory generation
52
Memory-as-a-Tool 53
Background vs. Blocking Operations 56
Memory Retrieval
56
Timing for retrieval 58
Inference with Memories 61
Memories in the System Instructions 61
Memories in the Conversation History 63
Procedural memories
64
5. Table of contents
Testing and Evaluation 65
Production considerations for Memory 67
Privacy and security risks
69
Conclusion 70
Endnotes 71
6. Context Engineering: Sessions, Memory
Stateful and personal AI begins
with Context Engineering.
Introduction
This whitepaper explores the critical role of Sessions and Memory in building stateful,
intelligent LLM agents to empower developers to create more powerful, personalized, and
persistent AI experiences. To enable Large Language Models (LLMs) to remember, learn, and
personalize interactions, developers must dynamically assemble and manage information
within their context window—a process known as Context Engineering.
These core concepts are summarized in the whitepaper below:
• Context Engineering: The process of dynamically assembling and managing information
within an LLM's context window to enable stateful, intelligent agents.
• Sessions: The container for an entire conversation with an agent, holding the
chronological history of the dialogue and the agent's working memory.
November 2025
6
7. Context Engineering: Sessions, Memory
• Memory: The mechanism for long-term persistence, capturing and consolidating key
information across multiple sessions to provide a continuous and personalized experience
for LLM agents.
Context Engineering
LLMs are inherently stateless. Outside of their training data, their reasoning and awareness
are confined to the information provided within the "context window" of a single API call.
This presents a fundamental problem, as AI agents must be equipped with operating
instructions identifying what actions can be taken, the evidential and factual data to reason
over, and the immediate conversational information that defines the current task. To build
stateful, intelligent agents that can remember, learn, and personalize interactions, developers
must construct this context for every turn of a conversation. This dynamic assembly and
management of information for an LLM is known as Context Engineering.
Context Engineering represents an evolution from traditional Prompt Engineering. Prompt
engineering focuses on crafting optimal, often static, system instructions. Conversely,
Context Engineering addresses the entire payload, dynamically constructing a state-aware
prompt based on the user, conversation history, and external data. It involves strategically
selecting, summarizing, and injecting different types of information to maximize relevance
while minimizing noise. External systems—such as RAG databases, session stores, and
memory managers—manage much of this context. The agent framework must orchestrate
these systems to retrieve and assemble context into the final prompt.
Think of Context Engineering as the mise en place for an agent—the crucial step where a
chef gathers and prepares all their ingredients before cooking. If you only give a chef the
recipe (the prompt), they might produce an okay meal with whatever random ingredients they
have. However, if you first ensure they have all the right, high-quality ingredients, specialized
November 2025
7
8. Context Engineering: Sessions, Memory
tools, and a clear understanding of the presentation style, they can reliably produce an
excellent, customized result. The goal of context engineering is to ensure the model has no
more and no less than the most relevant information to complete its task.
Context Engineering governs the assembly of a complex payload that can include a variety
of components:
• Context to guide reasoning defines the agent’s fundamental reasoning patterns and
available actions, dictating its behavior:
• System Instructions: High-level directives defining the agent's persona, capabilities,
and constraints.
• Tool Definitions: Schemas for APIs or functions the agent can use to interact with the
outside world.
• Few-Shot Examples: Curated examples that guide the model's reasoning process via
in-context learning.
• Evidential & Factual Data is the substantive data the agent reasons over, including pre-
existing knowledge and dynamically retrieved information for the specific task; it serves as
the 'evidence' for the agent's response:
• Long-Term Memory: Persisted knowledge about the user or topic, gathered across
multiple sessions.
• External Knowledge: Information retrieved from databases or documents, often using
Retrieval-Augmented Generation (RAG) 1 .
• Tool Outputs: The data or results returned by a tool.
• Sub-Agent Outputs: The conclusions or results returned by specialized agents that
have been delegated a specific sub-task.
November 2025
8
9. Context Engineering: Sessions, Memory
• Artifacts: Non-textual data (e.g., files, images) associated with the user or session.
• Immediate conversational information grounds the agent in the current interaction,
defining the immediate task:
• Conversation History: The turn-by-turn record of the current interaction.
• State / Scratchpad: Temporary, in-progress information or calculations the agent uses
for its immediate reasoning process.
• User's Prompt: The immediate query to be addressed.
The dynamic construction of context is critical. Memories, for instance, are not static; they
must be selectively retrieved and updated as the user interacts with the agent or new data
is ingested. Additionally, effective reasoning often relies on in-context learning 2 (a process
where the LLM learns how to perform tasks from demonstrations in the prompt). In-context
learning can be more effective when the agent uses few-shot examples that are releva
nt to the current task, rather than relying on hardcoded ones. Similarly, external knowledge is
retrieved by RAG tools based on the user's immediate query.
One of the most critical challenges in building a context-aware agent is managing an
ever-growing conversation history. In theory, models with large context windows can
handle extensive transcripts; in practice, as the context grows, cost and latency increase.
Additionally, models can suffer from "context rot," a phenomenon where their ability to
pay attention to critical information diminishes as context grows. Context Engineering
directly addresses this by employing strategies to dynamically mutate the history—such
as summarization, selective pruning, or other compaction techniques—to preserve vital
information while managing the overall token count, ultimately leading to more robust and
personalized AI experiences.
November 2025
9
10. Context Engineering: Sessions, Memory
This practice manifests as a continuous cycle within the agent's operational loop for each
turn of a conversation:
Figure 1. Flow of context management for agents
1. Fetch Context: The agent begins by retrieving context—such as user memories, RAG
documents, and recent conversation events. For dynamic context retrieval, the agent will
use the user query and other metadata to identify what information to retrieve.
2. Prepare Context: The agent framework dynamically constructs the full prompt for the
LLM call. Although individual API calls may be asynchronous, preparing the context is a
blocking, "hot-path" process. The agent cannot proceed until the context is ready.
3. Invoke LLM and Tools: The agent iteratively calls the LLM and any necessary tools
until a final response for the user is generated. Tool and model output is appended to
the context.
November 2025
10
11. Context Engineering: Sessions, Memory
4. Upload Context: New information gathered during the turn is uploaded to persistent
storage. This is often a "background" process, allowing the agent to complete execution
while memory consolidation or other post-processing occurs asynchronously.
At the heart of this lifecycle are two fundamental components: sessions and memory. A
session manages the turn-by-turn state of a single conversation. Memory, in contrast,
provides the mechanism for long-term persistence, capturing and consolidating key
information across multiple sessions.
You can think of a session as the workbench or desk you're using for a specific project.
While you're working, it's covered in all the necessary tools, notes, and reference materials.
Everything is immediately accessible but also temporary and specific to the task at hand.
Once the project is finished, you don't just shove the entire messy desk into storage. Instead,
you begin the process of creating memory, which is like an organized filing cabinet. You
review the materials on the desk, discard the rough drafts and redundant notes, and file
away only the most critical, finalized documents into labeled folders. This ensures the filing
cabinet remains a clean, reliable, and efficient source of truth for all future projects, without
being cluttered by the transient chaos of the workbench. This analogy directly mirrors how
an effective agent operates: the session serves as the temporary workbench for a single
conversation, while the agent's memory is the meticulously organized filing cabinet, allowing
it to recall key information during future interactions.
Building on this high-level overview of context engineering, we can now explore two core
components: sessions and memory, beginning with sessions.
November 2025
11
12. Context Engineering: Sessions, Memory
Sessions
A foundational element of Context Engineering is the session, which encapsulates the
immediate dialogue history and working memory for a single, continuous conversation.
Each session is a self-contained record that is tied to a specific user. The session allows
the agent to maintain context and provide coherent responses within the bounds of a
single conversation. A user can have multiple sessions, but each one functions as a distinct,
disconnected log of a specific interaction. Every session contains two key components: the
chronological history (events) and the agent's working memory (state).
Events are the building blocks of the conversation. Common types of events include: user
input (a message from the user (text, audio, image, etc.), agent response (the agent's reply
to the user), tool call (the agent’s decision to use an external tool or API), or tool output (the
data returned from a tool call, which the agent uses to continue its reasoning).
Beyond the chat history, a Session often includes a state—a structured "working memory" or
scratchpad. This holds temporary, structured data relevant to the current conversation, like
what items are in a shopping cart.
As the conversation progresses, the agent will append additional events to the session.
Additionally, it may mutate the state based on logic in the agent.
The structure of the events is analogous to the list of Content objects passed to the
Gemini API, where each item with a role and parts represents one turn—or one Event—in
the conversation.
November 2025
12
13. Context Engineering: Sessions, Memory
Python
contents = [
{
"role": "user",
"parts": [ {"text": "What is the capital of France?"} ]
}, {
"role": "model",
"parts": [ {"text": "The capital of France is Paris."} ]
}
]
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=contents
)
Snippet 1: Example multi-turn call to Gemini
A production agent's execution environment is typically stateless, meaning it retains no
information after a request is completed. Consequently, its conversation history must be
saved to persistent storage to maintain a continuous user experience. While in-memory
storage is suitable for development, production applications should leverage robust
databases to reliably store and manage sessions. For example, you can store conversation
history in managed solutions like Agent Engine Sessions 3 .
Variance across frameworks and models
While the core ideas are similar, different agent frameworks implement sessions, events, and
state in distinct ways. Agent frameworks are responsible for maintaining the conversation
history and state for LLMs, building LLM requests using this context, and parsing and storing
the LLM response.
November 2025
13
14. Context Engineering: Sessions, Memory
Agent frameworks act as a universal translator between your code and a LLM. While you,
the developer, work with the framework's consistent, internal data structures for each
conversational turn, the framework handles the critical task of converting those structures
into the precise format the LLM requires. This abstraction is powerful because it decouples
your agent's logic from the specific LLM you're using, preventing vendor lock-in.
Figure 2: Flow of context management for agents
Ultimately, the goal is to produce a "request" that the LLM can understand. For Google's
Gemini models, this is a List[Content]. Each Content object is a simple dictionary-like
structure containing two keys: role which defines who is speaking ("user" or "model") and
parts which defines the actual content of the message (text, images, tool calls, etc.).
The framework automatically handles mapping the data from its internal object (e.g., an ADK
Event) to the corresponding role and parts in the Content object before making the API call.
In essence, the framework provides a stable, internal API for the developer, while managing
the complex and varied external APIs of the different LLMs behind the scenes.
November 2025
14
15. Context Engineering: Sessions, Memory
ADK uses an explicit Session object that contains a list of Event objects and a separate
state object. The Session is like a filing cabinet, with one folder for the conversation history
(events) and another for working memory (state).
LangGraph doesn't have a formal "session" object. Instead, the state is the session. This all-
encompassing state object holds the conversation history (as a list of Message objects) and
all other working data. Unlike the append-only log of a traditional session, LangGraph's state
is mutable. It can be transformed, and strategies like history compaction can alter the record.
This is useful for managing long conversations and token limits.
Sessions for multi-agent systems
In a multi-agent system, multiple agents collaborate. Each agent focuses on a smaller,
specialized task. For these agents to work together effectively, they must share information.
As shown in the diagram below, the system's architecture defines the communication
patterns they use to share information. A central component of this architecture is how the
system handles session history—the persistent log of all interactions.
November 2025
15
16. Context Engineering: Sessions, Memory
Figure 3: Different multi-agent architectural patterns 30
Before exploring the architectural patterns for managing this history, it's crucial to distinguish
it from the context sent to an LLM. Think of the session history as the permanent, unabridged
transcript of the entire conversation. The context, on the other hand, is the carefully crafted
information payload sent to the LLM for a single turn. An agent might construct this context
by selecting only a relevant excerpt from the history or by adding special formatting, like a
guiding preamble, to steer the model's response. This section focuses on what information is
passed across agents, not necessarily what context is sent to the LLM.
November 2025
16
17. Context Engineering: Sessions, Memory
Agent frameworks handle session history for multi-agent systems using one of two primary
approaches: a shared, unified history where all agents contribute to a single log, or separate,
individual histories where each agent maintains its own perspective 4 . The choice between
these two patterns depends on the nature of the task and the desired collaboration style
between the agents.
For the shared, unified history model, all agents in the system read from and write all
events to the same, single conversation history. Every agent's message, tool call, and
observation is appended to one central log in chronological order. This approach is best for
tightly coupled, collaborative tasks requiring a single source of truth, such as a multi-step
problem-solving process where one agent's output is the direct input for the next. Even
with a shared history, a sub-agent might process the log before passing it to the LLM. For
instance, it could filter for a subset of relevant events or add labels to identify which agent
generated each event.
If you use ADK’s LLM-driven delegation to handoff to sub-agents, all of the intermediary
events of the sub-agent would be written to the same session as the root agent 5 :
Python
from google.adk.agents import LlmAgent
# The sub-agent has access to Session and writes events to it.
sub_agent_1 = LlmAgent(...)
# Optionally, the sub-agent can save the final response text (or structured
output) to the specified state key.
sub_agent_2 = LlmAgent(
...,
output_key="..."
)
Continues next page...
November 2025
17
18. Context Engineering: Sessions, Memory
# Parent agent.
root_agent = LlmAgent(
...,
sub_agents=[sub_agent_1, sub_agent_2]
)
Snippet 2: A2A communication across multiple agent frameworks
In the separate, individual histories model, each agent maintains its own private
conversation history and functions like a black box to other agents. All internal processes—
such as intermediary thoughts, tool use, and reasoning steps—are kept within the agent's
private log and are not visible to others. Communication occurs only through explicit
messages, where an agent shares its final output, not its process.
This interaction is typically implemented by either implementing Agent-as-a-tool or using the
Agent-to-Agent (A2A) Protocol. With Agent-as a-Tool, one agent invokes another as if it were
a standard tool, passing inputs and receiving a final, self-contained output 6 . With the Agent-
to-Agent (A2A) Protocol, agents use a structured protocol for direct messaging 7 .
We’ll explore the A2A protocol in more detail in the next session.
November 2025
18
19. Context Engineering: Sessions, Memory
Interoperability across multiple agent frameworks
Figure 4: A2A communication across multiple agents that use different frameworks
A framework's use of an internal data representation introduces a critical architectural
trade-off for multi-agent system: the very abstraction that decouples an agent from an
LLM also isolates it from agents using other agent frameworks. This isolation is solidified
at the persistence layer. The storage model for a Session typically couples the database
schema directly to the framework's internal objects, creating a rigid, relatively non-portable
conversation record. Therefore, an agent built with LangGraph cannot natively interpret the
distinct Session and Event objects persisted by an ADK-based agent, making seamless task
handoffs impossible.
November 2025
19
20. Context Engineering: Sessions, Memory
One emerging architectural pattern architectural pattern for coordinating collaboration
between these isolated agents is Agent-to-Agent (A2A) communication 8 . While this pattern
enables agents to exchange messages, it fails to address the core problem of sharing rich,
contextual state. Each agent's conversation history is encoded in its framework's internal
schema. As a result, any A2A message containing session events requires a translation layer
to be useful.
A more robust architectural pattern for interoperability involves abstracting shared
knowledge into a framework-agnostic data layer, such as Memory. Unlike a Session
store, which preserves raw, framework-specific objects like Events and Messsages, a
memory layer is designed to hold processed, canonical information. Key information—like
summaries, extracted entities, and facts—is extracted from the conversation and is typically
stored as strings or dictionaries. The memory layer’s data structures are not coupled to
any single framework's internal data representation, which allows it to serve as a universal,
common data layer. This pattern allows heterogeneous agents to achieve true collaborative
intelligence by sharing a common cognitive resource without requiring custom translators.
Production Considerations for Sessions
When moving an agent to a production environment, its session management system must
evolve from a simple log to a robust, enterprise-grade service. The key considerations
fall into three critical areas: security and privacy, data integrity, and performance. A
managed session store, like Agent Engine Sessions, is specifically designed to address these
production requirements.
November 2025
20
21. Context Engineering: Sessions, Memory
Security and Privacy
Protecting the sensitive information contained within a session is a non-negotiable
requirement. Strict Isolation is the most critical security principle. A session is owned by
a single user, and the system must enforce strict isolation to ensure one user can never
access another user's session data (i.e. via ACLs). Every request to the session store must be
authenticated and authorized against the session's owner.
A best practice for handling Personally Identifiable Information (PII) is to redact it before the
session data is ever written to storage. This is a fundamental security measure that drastically
reduces the risk and "blast radius" of a potential data breach. By ensuring sensitive data
is never persisted using tools like Model Armor 9 , you simplify compliance with privacy
regulations like GDPR and CCPA and build user trust.
Data Integrity and Lifecycle Management
A production system requires clear rules for how session data is stored and maintained
over time. Sessions should not live forever. You can implement a Time-to-Live (TTL) policy
to automatically delete inactive sessions to manage storage costs and reducing data
management overhead. This requires a clear data retention policy that defines how long
sessions should be kept before being archived or permanently deleted.
Additionally, the system must guarantee that operations are appended to the session history
in a deterministic order. Maintaining the correct chronological sequence of events is
fundamental to the integrity of the conversation log.
November 2025
21
22. Context Engineering: Sessions, Memory
Performance and Scalability
Session data is on the "hot path" of every user interaction, making its performance a
primary concern. Reading and writing the session history must be extremely fast to ensure
a responsive user experience. Agent runtimes are typically stateless, so the entire session
history is retrieved from a central database at the start of every turn, incurring network
transfer latency.
To mitigate latency, it is crucial to reduce the size of the data transferred. A key optimization
is to filter or compact the session history before sending it to the agent. For example, you
can remove old, irrelevant function call outputs that are no longer needed for the current
state of the conversation. The following section details several strategies for compacting
history to effectively manage long-context conversations.
Managing long context conversation: tradeoffs
and optimizations
In a simplistic architecture, a session is an immutable log of the conversation between
the user and agent. However, as the conversation scales, the conversation’s token usage
increases. Modern LLMs can handle long contexts, but limitations exist, especially for
latency-sensitive applications 10 :
1. Context Window Limits: Every LLM has a maximum amount of text (context window) it
can process at once. If the conversation history exceeds this limit, the API call will fail.
2. API Costs ($): Most LLM providers charge based on the number of tokens you send and
receive. Shorter histories mean fewer tokens and lower costs per turn.
November 2025
22
23. Context Engineering: Sessions, Memory
3. Latency (Speed): Sending more text to the model takes longer to process, resulting
in a slower response time for the user. Compaction keeps the agent feeling quick
and responsive.
4. Quality: As the number of tokens increases, performance can get worse due to additional
noise in the context and autoregressive errors.
Managing a long conversation with an agent can be compared to a savvy traveler packing a
suitcase for a long trip. The suitcase represents the agent’s limited context window, and the
clothes and items are the pieces of information from the conversation. If you simply try to
stuff everything in, the suitcase becomes too heavy and disorganized, making it difficult to
find what you need quickly—like how an overloaded context window increases processing
costs and slows down response times. On the other hand, if you pack too little, you risk
leaving behind essential items like a passport or a warm coat, compromising the entire trip—
like how an agent could lose critical context, leading to irrelevant or incorrect answers. Both
the traveler and the agent operate under a similar constraint: success hinges not on how
much you can carry, but on carrying only what you need.
Compaction strategies shrink long conversation histories, condensing dialogue to fit
within the model's context window, reducing API costs and latency. As a conversation gets
longer, the history sent to the model with each turn can become too large. Compaction
strategies solve this by intelligently trimming the history while trying to preserve the most
important context.
So, how do you know what content to throw out of a Session without losing valuable
information? Strategies range from simple truncation to sophisticated compaction:
• Keep the last N turns: This is the simplest strategy. The agent only keeps the most recent
N turns of the conversation (a “sliding window”) and discards everything older.
November 2025
23
24. Context Engineering: Sessions, Memory
• Token-Based Truncation: Before sending the history to the model, the agent counts the
tokens in the messages, starting with the most recent and working backward. It includes
as many messages as possible without exceeding a predefined token limit (e.g., 4000
tokens). Everything older is simply cut off.
• Recursive Summarization: Older parts of the conversation are replaced by an AI-
generated summary. As the conversation grows, the agent periodically uses another LLM
call to summarize the oldest messages. This summary is then used as a condensed form
of the history, often prefixed to the more recent, verbatim messages.
For example, you can keep the last N turns with ADK by using a built-in plug-in for your ADK
app to limit the context sent to the model. This does not modify the historical events stored in
your session storage:
Python
from google.adk.apps import App
from google.adk.plugins.context_filter_plugin import ContextFilterPlugin
app = App(
name='hello_world_app',
root_agent=agent,
plugins=[
# Keep the last 10 turns and the most recent user query.
ContextFilterPlugin(num_invocations_to_keep=10),
],
)
Snippet 3: Session truncation to only use the last N turns with ADK
November 2025
24
25. Context Engineering: Sessions, Memory
Given that sophisticated compaction strategies aim to reduce cost and latency, it is
critical to perform expensive operations (like recursive summarization) asynchronously
in the background and persist the results. “In the background” ensures the client is not
kept waiting, and “persistence” ensures that expensive computations are not excessively
repeated. Frequently, the agent's memory manager is responsible for both generating and
persisting these recursive summaries. The agent must also keep a record of which events are
included in the compacted summary; this prevents the original, more verbose events from
being needlessly sent to the LLM.
Additionally, the agent must decide when compaction is necessary. The trigger mechanism
generally falls into a few distinct categories:
• Count-Based Triggers (i.e. token size or turn count threshold): The conversation is
compacted once the conversation exceeds a certain predefined threshold. This approach
is often “good enough" for managing context length.
• Time-Based Triggers: Compaction is triggered not by the size of the conversation, but
by a lack of activity. If a user stops interacting for a set period (e.g., 15 or 30 minutes), the
system can run a compaction job in the background.
• Event-Based Triggers (i.e. Semantic/Task Completion): The agent decides to trigger
compaction when it detects that a specific task, sub-goal, or topic of conversation
has concluded.
For example, you can use ADK’s EventsCompactionConfig to trigger LLM-based
summarization after a configured number of turns:
November 2025
25
26. Context Engineering: Sessions, Memory
Python
from google.adk.apps import App
from google.adk.apps.app import EventsCompactionConfig
app = App(
name='hello_world_app',
root_agent=agent,
events_compaction_config=EventsCompactionConfig(
compaction_interval=5,
overlap_size=1,
),
)
Snippet 4: Session compaction using summarization with ADK
Memory generation is the broad capability of extracting persistent knowledge from a
verbose and noisy data source. In this section, we covered a primary example of extracting
information from conversation history: session compaction. Compaction distills the verbatim
transcript of an entire conversation, extracting key facts and summaries while discarding
conversational filler.
Building on compaction, the next section will explore memory generation and management
more broadly. We will discuss the various ways memories can be created, stored, and
retrieved to build an agent's long-term knowledge.
November 2025
26
27. Context Engineering: Sessions, Memory
Memory
Memory and Sessions share a deeply symbiotic relationship: sessions are the primary data
source for generating memories, and memories are a key strategy for managing the size of
a session. A memory is a snapshot of extracted, meaningful information from a conversation
or data source. It’s a condensed representation that preserves important context, making it
useful for future interactions. Generally, memories are persisted across sessions to provide a
continuous and personalized experience.
As a specialized, decoupled service, a “memory manager” provides the foundation for multi-
agent interoperability. Memory managers frequently use framework-agnostic data structures,
like simple strings and dictionaries. This allows agents built on different frameworks to
connect to a single memory store, enabling the creation of a shared knowledge base that any
connected agent can utilize.
Note: some frameworks may also refer to Sessions or verbatim conversation as “short-term
memory.” For this whitepaper, memories are defined as extracted information, not the raw
dialogue of turn-by-turn conversation.
Storing and retrieving memories is crucial for building sophisticated and intelligent agents. A
robust memory system transforms a basic chatbot into a truly intelligent agent by unlocking
several key capabilities:
• Personalization: The most common use case is to remember user preferences, facts,
and past interactions to tailor future responses. For example, remembering a user's
favorite sports team or their preferred seat on an airplane creates a more helpful and
personal experience.
November 2025
27
28. Context Engineering: Sessions, Memory
• Context Window Management: As conversations become longer, the full history can
exceed an LLM's context window. Memory systems can compact this history by creating
summaries or extracting key facts, preserving context without sending thousands of
tokens in every turn. This reduces both cost and latency.
• Data Mining and Insight: By analyzing stored memories across many users (in an
aggregated, privacy-preserving way), you can extract insights from the noise. For
example, a retail chatbot might identify that many users are asking about the return policy
for a specific product, flagging a potential issue.
• Agent Self-Improvement and Adaptation: The agent learns from previous runs by
creating procedural memories about its own performance—recording which strategies,
tools, or reasoning paths led to successful outcomes. This enables the agent to build
a playbook of effective solutions, allowing it to adapt and improve its problem-solving
over time.
Creating, storing, and utilizing memory in an AI system is a collaborative process. Each
component in the stack—from the end-user to the developer's code—has a distinct role
to play.
1. The User: Provides the raw source data for memories. In some systems, users may
provide memories directly (i.e. via a form).
2. The Agent (Developer Logic): Configures how to decide what and when to remember,
orchestrating calls to the memory manager. In simple architectures, the developer can
implement the logic such that memory is *always* retrieved and *always* triggered-to-be-
generated. In more advanced architectures, the developer may implement memory-as-a-
tool, where the agent (via LLM) decides when memory should be retrieved or generated.
November 2025
28
29. Context Engineering: Sessions, Memory
3. The Agent Framework (e.g., ADK, LangGraph): Provides the structure and tools for
memory interaction. The framework acts as the plumbing. It defines how the developer's
logic can access conversation history and interact with the memory manager, but
it doesn't manage the long-term storage itself. It also defines how to stuff retrieved
memories into the context window.
4. The Session Storage (i.e. Agent Engine Sessions, Spanner, Redis): Stores the turn-
by-turn conversation of the Session. The raw dialogue will be ingested into the memory
manager in order to generate memories.
5. The Memory Manager (e.g. Agent Engine Memory Bank, Mem0, Zep): Handles the
storage, retrieval, and compaction of memories. The mechanisms to store and retrieve
memories depend on what provider is used. This is the specialized service or component
that takes the potential memory identified by the agent and handles its entire lifecycle.
• Extraction distills the key information from the source data.
• Consolidation curates memories to merge duplicative entities.
• Storage persists the memory to persistent databases.
• Retrieval fetches relevant memories to provide context for new interactions
November 2025
29
30. Context Engineering: Sessions, Memory
Figure 5: The flow of information between sessions, memory, and external knowledge
The division of responsibilities ensures that the developer can focus on the agent's unique
logic without having to build the complex underlying infrastructure for memory persistence
and management. It is important to recognize that a memory manager is an active system,
not just a passive vector database. While it uses similarity search for retrieval, its core value
November 2025
30
31. Context Engineering: Sessions, Memory
lies in its ability to intelligently extract, consolidate, and curate memories over time. Managed
memory services, like Agent Engine Memory Bank, handle the entire lifecycle of memory
generation and storage, freeing you to focus on your agent's core logic.
This retrieval capability is also why memory is frequently compared to another key
architectural pattern: Retrieval-Augmented Generation (RAG). However, they are built on
different architectural principles, as RAG handles static, external data while Memory curates
dynamic, user-specific context. They fulfill two distinct and complementary roles: RAG makes
an agent an expert on facts, while memory makes it an expert on the user. The following
chart breaks down their high-level differences:
November 2025
31
32. Context Engineering: Sessions, Memory
RAG Engines Memory Managers
To inject external, factual knowledge into
the context To create a personalized and stateful
experience. The agent remembers facts,
adapts to the user over time, and maintains
long-running context.
Data source A static, pre-indexed external knowledge
base (e.g., PDFs, wikis, documents, APIs). The dialogue between the user and agent.
Isolation Level Generally Shared. The knowledge
base is typically a global, read-only
resource accessible by all users to ensure
consistent, factual answers. Highly Isolated: Memory is almost always
scoped per-user to prevent data leaks.
Information type Static, factual, and authoritative. Often
contains domain-specific data, product
details, or technical documentation. Dynamic and (generally) user-specific.
Memories are derived from conversation,
so there’s an inherent level of uncertainty.
Write patterns Batch processing Event-based processing
Triggered via an offline,
administrative action. Triggered at some cadence (i.e. every
turn or at the end of a session) or
Memory-as-a-tool (agent decides to
generate memories).
Read patterns RAG data is almost always retrieved “as-
a-tool”. It’s retrieved when the agent
decides that the user’s query requires
external information. There are two common read patterns:
•
Memory-as-a-tool: Retrieved when
the user’s query requires additional
information about the user (or some
other identity).
•
Static retrieval: Memory is always
retrieved at the start of each turn.
Data Format A natural-language “chunk”. A natural language snippet or a
structured profile.
Primary Goal
Data preparation Chunking and Indexing: Source
documents are broken into smalvler
Chunks, which are then converted to
embeddings and stored for fast lookup.
Extraction and consolidation: Extract
key details from the conversation, ensuring
content is not duplicative or contradictory.
Table 1: Comparison of RAG engines and memory managers
November 2025
32
33. Context Engineering: Sessions, Memory
A helpful way to understand the difference is to think of RAG as the agent's research librarian
and a memory manager as its personal assistant.
The research librarian (RAG) works in a vast public library filled with encyclopedias,
textbooks, and official documents. When the agent needs an established fact—like a
product's technical specifications or a historical date—it consults the librarian. The librarian
retrieves information from this static, shared, and authoritative knowledge base to provide
consistent, factual answers. The librarian is an expert on the world's facts, but they don't
know anything personal about the user asking the question.
In contrast, the personal assistant (memory) follows the agent and carries a private
notebook, recording the details of every interaction with a specific user. This notebook
is dynamic and highly isolated, containing personal preferences, past conversations, and
evolving goals. When the agent needs to recall a user's favorite sports team or the context
of last week's project discussion, it turns to the assistant. The assistant's expertise is not in
global facts, but in the user themselves.
Ultimately, a truly intelligent agent needs both. RAG provides it with expert knowledge of the
world, while memory provides it with an expert understanding of the user it's serving.
The next section deconstructs the concept of memory by examining its core components:
the types of information it stores, the patterns for its organization, the mechanisms for its
storage and creation, the strategic definition of its scope, and its handling of multimodal
versus textual data.
November 2025
33
34. Context Engineering: Sessions, Memory
Types of memory
An agent's memory can be categorized by how the information is stored and how it was
captured. These different types of memory work together to create a rich, contextual
understanding of a user and their needs. Across all types of memories, the rule stands that
memories are descriptive, not predictive.
A “memory” is an atomic piece of context that is returned by the memory manager and
used by the agent as context. While the exact schema can vary, a single memory generally
consists of two main components: content and metadata.
Content is the substance of the memory that was extracted from the source data (i.e. the
raw dialogue of the session). Crucially, the content is designed to be framework-agnostic,
using simple data structures that any agent can easily ingest. The content can either be
structured or unstructured data. Structured memories include information typically
stored in universal formats like a dictionary or JSON. Its schema is typically defined by the
developer, not a specific framework. For example, {“seat_preference”: “Window”}.
Unstructured memories are natural language descriptions that capture the essence of a
longer interaction, event, or topic. For example, “The user prefers a window seat.”
Metadata provides context about the memory, typically stored as a simple string. This can
include a unique identifier for the memory, identifiers for the “owner” of the memory, and
labels describing the content or data source of the memory.
November 2025
34
35. Context Engineering: Sessions, Memory
Types of information
Beyond their basic structure, memories can be classified by the fundamental type of
knowledge they represent. This distinction, crucial for understanding how an agent
uses memories, separates memory into two primary functional categories derived from
cognitive science 11 : declarative memories (“knowing what”) and procedural memories
(“knowing how”).
Declarative memory is the agent's knowledge of facts, figures, and events. It's all the
information that the agent can explicitly state or "declare." If the memory is an answer to a
"what" question, it's declarative. This category encompasses both general world knowledge
(Semantic) and specific user facts (Entity/Episodic).
Procedural memory is the agent's knowledge of skills and workflows. It guides the
agent's actions by demonstrating implicitly how to perform a task correctly. If the
memory helps answer a "how" question—like the correct sequence of tool calls to book a
trip—it's procedural.
Organization patterns
Once a memory is created, the next question is how to organize it. Memory managers
typically employ one or more of the following patterns to organize memories: Collections 12 ,
Structured User Profile, or “Rolling Summary”. The patterns define how individual
memories relate to each other and to the user.
November 2025
35
36. Context Engineering: Sessions, Memory
The collections 13 pattern organizes content into multiple self-contained, natural language
memories for a single user. Each memory is a distinct event, summary, or observation,
although there may be multiple memories in the collection for a single high-level topic.
Collections allow for storing and searching through a larger, less structured pool of
information related to specific goals or topics.
The structured user profile pattern organizes memories as a set of core facts about a user,
like a contact card that is continuously updated with new, stable information. It’s designed for
quick lookups of essential, factual information like names, preferences, and account details.
Unlike a structured user profile, the “rolling” summary pattern consolidates all information
into a single, evolving memory that represents a natural-language summary of the entire
user-agent relationship. Instead of creating new, individual memories, the manager
continuously updates this one master document. This pattern is frequently used to compact
long Sessions, preserving vital information while managing the overall token count.
Storage architectures
Additionally, the storage architecture is a critical decision that determines how quickly and
intelligently an agent can retrieve memories. The choice of architecture defines whether the
agent excels at finding conceptually similar ideas, understanding structured relationships,
or both.
Memories are generally stored in vector databases and/or knowledge graphs. Vector
databases help find memories that are conceptually similar to the query. Knowledge graphs
store memories as a network of entities and their relationships.
November 2025
36
37. Context Engineering: Sessions, Memory
Vector databases are the most common approach, enabling retrieval based on semantic
similarity rather than exact keywords. Memories are converted into embedding vectors,
and the database finds the closest conceptual matches to a user's query. This excels at
retrieving unstructured, natural language memories where context and meaning are key (i.e.
“atomic facts” 14 ).
Knowledge graphs are used to store memories as a network of entities (nodes) and their
relationships (edges). Retrieval involves traversing this graph to find direct and indirect
connections, allowing the agent to reason about how different facts are linked. It is ideal for
structured, relational queries and understanding complex connections within the data (i.e.
“knowledge triples” 15 ).
You can also combine both methods into a hybrid approach by enriching a knowledge
graph's structured entities with vector embeddings. This enables the system to perform both
relational and semantic searches simultaneously. This provides the structured reasoning
of a graph and the nuanced, conceptual search of a vector database, offering the best of
both worlds.
Creation mechanisms
We can also classify memories by how they were created, including how the information was
derived. Explicit memories are created when the user gives a direct command to the agent
to remember something (e.g., "Remember my anniversary is October 26th"). On the other
hand, implicit memories are created when the agent infers and extracts information from
the conversation without a direct command (e.g., “My anniversary is next week. Can you help
me find a gift for my partner?“)
November 2025
37
38. Context Engineering: Sessions, Memory
Memories can also be distinguished by whether the memory extraction logic is located
internally or externally to the agent framework. Internal memory refers to memory
management that is built directly into the agent framework. It’s convenient for getting
started but often lacks advanced features. Internal memory can use external storage, but the
mechanism for generating memories is internal to the agent.
External Memory involves using a separate, specialized service dedicated to memory
management (e.g., Agent Engine Memory Bank, Mem0, Zep). The agent framework makes
API calls to this external service to store, retrieve, and process memories. This approach
provides more sophisticated features like semantic search, entity extraction, and automatic
summarization, offloading the complex task of memory management to a purpose-built tool.
Memory scope
You also need to consider who or what a memory describes. This has implications on what
entity (i.e. a user, session, or application) you use to aggregate and retrieve memories.
User-Level scope is the most common implementation, designed to create a continuous,
personalized experience for each individual; for example, “the User prefers the middle seat.”
Memories are tied to a specific user ID and persist across all their sessions, allowing the
agent to build a long-term understanding of their preferences and history.
Session-Level scope is designed for the compaction of long conversations; for example,
“the User is shopping for tickets between New York and Paris between November 7, 2025 and
November 14, 2025. They prefer direct flights and the middle seat”. It creates a persistent
record of insights extracted from a single session, allowing an agent to replace the verbose,
November 2025
38
39. Context Engineering: Sessions, Memory
token-heavy transcript with a concise set of key facts. Crucially, this memory is distinct
from the raw session log; it contains only the processed insights from the dialogue, not the
dialogue itself, and its context is isolated to that specific session.
Application-level scope (or global context), are memories accessible by all users of an
application; for example, “The codename XYZ refers to the project….” This scope is used
to provide shared context, broadcast system-wide information, or establish a baseline of
common knowledge. A common use case for application-level memories is procedural
memories, which provide "how-to" instructions for the agent; the memories are generally
intended to help with the agent’s reasoning for all users. It is critical that these memories are
sanitized of all sensitive content to prevent data leaks between users.
Multimodal memory
"Multimodal memory" is a crucial concept that describes how an agent handles non-textual
information, like images, videos, and audio. The key is to distinguish between the data the
memory is derived from (its source) and the data the memory is stored as (its content).
Memory from a multimodal source is the most common implementation. The agent can
process various data types—text, images, audio—but the memory it creates is a textual
insight derived from that source. For example, an agent can process a user's voice memo
to create memories. It doesn't store the audio file itself; instead, it transcribes the audio and
creates a textual memory like, "User expressed frustration about the recent shipping delay."
November 2025
39
40. Context Engineering: Sessions, Memory
Memory with Multimodal Content is a more advanced approach where the memory
itself contains non-textual media. The agent doesn't just describe the content; it stores the
content directly. For example, a user can upload an image and say "Remember this design
for our logo." The agent creates a memory that directly contains the image file, linked to the
user's request.
Most contemporary memory managers focus on handling multimodal sources while
producing textual content. This is because generating and retrieving unstructured binary
data like images or audio for a specific memory requires specialized models, algorithms, and
infrastructure. It is far simpler to convert all inputs into a common, searchable format: text.
For example, you can generate memories from multimodal input 16 using Agent Engine
Memory Bank. The output memories will be textual insights extracted from the content:
Python
from google.genai import types
client = vertexai.Client(project=..., location=...)
response = client.agent_engines.memories.generate(
name=agent_engine_name,
direct_contents_source={
"events": [
{
"content": types.Content(
role="user",
parts=[
types.Part.from_text(
"This is context about the multimodal input."
),
Continues next page...
November 2025
40
41. Context Engineering: Sessions, Memory
types.Part.from_bytes(
data=CONTENT_AS_BYTES,
mime_type=MIME_TYPE
),
types.Part.from_uri(
file_uri="file/path/to/content",
mime_type=MIME_TYPE
)
])}]},
scope={"user_id": user_id}
)
Snippet 5: Example memory generation API call for Agent Engine Memory Bank
The next section examines the mechanics of memory generation, detailing the two
core stages: the extraction of new information from source data, and the subsequent
consolidation of that information with the existing memory corpus.
Memory Generation: Extraction and Consolidation
Memory generation autonomously transforms raw conversational data into structured,
meaningful insights, functioning. Think of it as an LLM-driven ETL (Extract, Transform,
Load) pipeline designed to extract and condense memories. Memory generation’s ETL
pipeline distinguishes memory managers from RAG engines and traditional databases.
Rather than requiring developers to manually specify database operations, a memory
manager uses an LLM to intelligently decide when to add, update, or merge memories.
This automation is a memory manager’s core strength; it abstracts away the complexity of
managing the database contents, chaining together LLM calls, and deploying background
services for data processing.
November 2025
41
42. Context Engineering: Sessions, Memory
Figure 6: High-level algorithm of memory generation which extracts memories from new data sources and
consolidates them with existing memories
While the specific algorithms vary by platform (e.g., Agent Engine Memory Bank, Mem0, Zep),
the high-level process of memory generation generally follows these four stages:
1. Ingestion: The process begins when the client provides a source of raw data, typically a
conversation history, to the memory manager.
2. Extraction & Filtering: The memory manager uses an LLM to extract meaningful content
from the source data. The key is that this LLM doesn't extract everything; it only captures
information that fits a predefined topic definition. If the ingested data contains no
information that matches these topics, no memory is created.
3. Consolidation: This is the most sophisticated stage, where the memory manager handles
conflict resolution and deduplication. It performs a "self-editing" process, using an
LLM to compare the newly extracted information with existing memories. To ensure the
user's knowledge base remains coherent, accurate, and evolves over time based on new
information, the manager can decide to:
• Merge the new insight into an existing memory.
November 2025
42
43. Context Engineering: Sessions, Memory
• Delete an existing memory if it’s now invalidated.
• Create an entirely new memory if the topic is novel.
4. Storage: Finally, the new or updated memory is persisted to a durable storage layer (such
as a vector database or knowledge graph) so it can be retrieved in future interactions.
A managed memory manager, like Agent Engine Memory Bank, fully automates this pipeline.
They provide a single, coherent system for turning conversational noise into structured
knowledge, allowing developers to focus on agent logic rather than building and maintaining
the underlying data infrastructure themselves. For example, triggering memory generation
with Memory Bank only requires a simple API call 17 :
Python
from google.cloud import vertexai
client = vertexai.Client(project=..., location=...)
client.agent_engines.memories.generate(
name="projects/.../locations/...reasoningEngines/...",
scope={"user_id": "123"},
direct_contents_source={
"events": [...]
},
config={
# Run memory generation in the background.
"wait_for_completion": False
}
}
Snippet 6: Generate memories with Agent Engine Memory Bank
November 2025
43
44. Context Engineering: Sessions, Memory
The process of memory generation can be compared to the work of a diligent gardener
tending to a garden. Extraction is like receiving new seeds and saplings (new information
from a conversation). The gardener doesn't just throw them randomly onto the plot. Instead,
they perform Consolidation by pulling out weeds (deleting redundant or conflicting data),
pruning back overgrown branches to improve the health of existing plants (refining and
summarizing existing memories), and then carefully planting the new saplings in the optimal
location. This constant, thoughtful curation ensures the garden remains healthy, organized,
and continues to flourish over time, rather than becoming an overgrown, unusable mess. This
asynchronous process happens in the background, ensuring the garden is always ready for
the next visit.
Now, let’s dive into the two key steps of memory generation: extraction and consolidation.
Deep-dive: Memory Extraction
The goal of memory extraction is to answer the fundamental question: "What information
in this conversation is meaningful enough to become a memory?" This is not simple
summarization; it is a targeted, intelligent filtering process designed to separate the signal
(important facts, preferences, goals) from the noise (pleasantries, filler text).
"Meaningful" is not a universal concept; it is defined entirely by the agent's purpose and use
case. What a customer support agent needs to remember (e.g., order numbers, technical
issues) is fundamentally different from what a personal wellness coach needs to remember
(e.g., long-term goals, emotional states). Customizing what information is preserved is
therefore the key to creating a truly effective agent.
November 2025
44
45. Context Engineering: Sessions, Memory
The memory manager's LLM decides what to extract by following a carefully constructed
set of programmatic guardrails and instructions, usually embedded in a complex system
prompt. This prompt defines what "meaningful" means by providing the LLM with a set
of topic definitions. With schema and template-based extraction, the LLM is given a
predefined JSON schema or a template using LLM features like structured output 18 ; the LLM
is instructed to construct the JSON using corresponding information in the conversation.
Alternatively, with natural language topic definitions, the LLM is guided by a simple natural
language description of the topic.
With few-shot prompting, the LLM is "shown" what information to extract using examples.
The prompt includes several examples of input text and the ideal, high-fidelity memory that
should be extracted. The LLM learns the desired extraction pattern from the examples,
making it highly effective for custom or nuanced topics that are difficult to describe with a
schema or a simple definition.
Most memory managers work out-of-the-box by looking for common topics, such as user
preferences, key facts, or goals. Many platforms also allow developers to define their own
custom topics, tailoring the extraction process to their specific domain. For example, you can
customize what information Agent Engine Memory Bank considers to be meaningful to be
persisted by providing your own topic definitions and few-shot examples 19 :
Python
from google.genai.types import Content, Part
# See https://cloud.google.com/agent-builder/agent-engine/memory-bank/set-up for
more information.
memory_bank_config = {
"customization_configs": [{
"memory_topics": [
{ "managed_memory_topic": {"managed_topic_enum": "USER_PERSONAL_INFO" }},
Continues next page...
November 2025
45
46. Context Engineering: Sessions, Memory
{
"custom_memory_topic": {
"label": "business_feedback",
"description": """Specific user feedback about their experience at the coffee
shop. This includes opinions on drinks, food, pastries, ambiance, staff friendliness,
service speed, cleanliness, and any suggestions for improvement."""
}
}
],
"generate_memories_examples": {
"conversationSource": {
"events": [
{
"content": Content(
role="model",
parts=[Part(text="Welcome back to The Daily Grind! We'd love to hear
your feedback on your visit.")])
},{
"content": Content(
role="user",
parts=[Part(text= "Hey. The drip coffee was a bit lukewarm today, which
was a bummer. Also, the music was way too loud, I could barely hear my friend.")])
}]
},
"generatedMemories": [
{"fact": "The user reported that the drip coffee was lukewarm."},
{"fact": "The user felt the music in the shop was too loud."}
]
}
}]
}
agent_engine = client.agent_engines.create(
config={
"context_spec": {"memory_bank_config": memory_bank_config }
}
)
Snippet 7: Customizing what information Agent Engine Memory Bank considers meaningful to persist
Although memory extraction itself is not “summarization,” the algorithm may incorporate
summarization to distill information. To enhance efficiency, many memory managers
incorporate a rolling summary of the conversation directly into the memory extraction
November 2025
46
47. Context Engineering: Sessions, Memory
prompt 20 . This condensed history provides the necessary context to extract key information
from the most recent interactions. It eliminates the need to repeatedly process the full,
verbose dialogue with each turn to maintain context.
Once information has been extracted from the data source, the existing corpus of memories
must be updated to reflect the new information via consolidation.
Deep-dive: Memory Consolidation
After memories are extracted from the verbose conversation, consolidation should integrate
the new information into a coherent, accurate, and evolving knowledge base. It is arguably
the most sophisticated stage in the memory lifecycle, transforming a simple collection of
facts into a curated understanding of the user. Without consolidation, an agent's memory
would quickly become a noisy, contradictory, and unreliable log of every piece of information
ever captured. This "self-curation" is typically managed by an LLM and is what elevates a
memory manager beyond a simple database.
Consolidation addresses fundamental problems arising from conversational data, including:
• Information Duplication: A user might mention the same fact in multiple ways across
different conversations (e.g., "I need a flight to NYC" and later "I'm planning a trip to New
York"). A simple extraction process would create two redundant memories.
• Conflicting Information: A user's state changes over time. Without consolidation, the
agent's memory would contain contradictory facts.
• Information Evolution: A simple fact can become more nuanced. An initial memory that
"the user is interested in marketing" might evolve into "the user is leading a marketing
project focused on Q4 customer acquisition."
November 2025
47
48. Context Engineering: Sessions, Memory
• Memory Relevance Decay: Not all memories remain useful forever. An agent must
engage in forgetting—proactively pruning old, stale, or low-confidence memories to
keep the knowledge base relevant and efficient. Forgetting can happen by instructing the
LLM to defer to newer information during consolidation or through automatic deletion via
a time-to-live (TTL).
The consolidation process is an LLM-driven workflow that compares newly extracted insights
against the user's existing memories. First, the workflow tries to retrieve existing memories
that are similar to the newly extracted memories. These existing memories are candidates
for consolidation. If the existing memory is contradicted by the new information, it may be
deleted. If it is augmented, it may be updated.
Second, an LLM is presented with both the existing memories and the new information. Its
core task is to analyze them together and identify what operations should be performed. The
primary operations include:
• UPDATE: Modify an existing memory with new or corrected information.
• CREATE: If the new insight is entirely novel and unrelated to existing memories, create a
new one.
• DELETE / INVALIDATE: If the new information makes an old memory completely irrelevant
or incorrect, delete or invalidate it.
Finally, the memory manager translates the LLM's decision into a transaction that updates
the memory store.
November 2025
48
49. Context Engineering: Sessions, Memory
Memory Provenance
The classic machine learning axiom of "garbage in, garbage out" is even more critical for
LLMs, where the outcome is often "garbage in, confident garbage out." For an agent to make
reliable decisions and for a memory manager to effectively consolidate memories, they must
be able to critically evaluate the quality of its own memories. This trustworthiness is derived
directly from a memory’s provenance—a detailed record of its origin and history.
Figure 7: The flow of information between data sources and memories. A single memory can be derived from
multiple data sources, and a single data source may contribute to multiple memories.
The process of memory consolidation—merging information from multiple sources into
a single, evolving memory—creates the need to track its lineage. As shown in the diagram
above, a single memory might be a blend of multiple data sources, and a single source might
be segmented into multiple memories.
November 2025
49
50. Context Engineering: Sessions, Memory
To assess trustworthiness, the agent must track key details for each source, such as its origin
(source type) and age (“freshness”). These details are critical for two reasons: they dictate
the weight each source has during memory consolidation, and they inform how much the
agent should rely on that memory during inference.
The source type is one of the most important factors in determining trust. Data sources fall
into three main categories:
• Bootstrapped Data: Information pre-loaded from internal systems, such as a CRM.
This high-trust data can be used to initialize a user's memories to address the cold-start
problem, which is the challenge of providing a personalized experience to a user the agent
has never interacted with before.
• User Input: This includes data provided explicitly (e.g., via a form, which is high-trust) or
information extracted implicitly from a conversation (which is generally less trustworthy).
• Tool Output: Data returned from an external tool call. Generating memories from Tool
Output is generally discouraged because these memories tend to be brittle and stale,
making this source type better suited for short-term caching.
Accounting for memory lineage during memory management
This dynamic, multi-source approach to memory creates two primary operational challenges
when managing memories: conflict resolution and deleting derived data.
Memory consolidation inevitably leads to conflicts where one data source conflicts with
another. A memory’s provenance allows the memory manager to establish a hierarchy of
trust for its information sources. When memories from different sources contradict each
November 2025
50
51. Context Engineering: Sessions, Memory
other, the agent must use this hierarchy in a conflict resolution strategy. Common strategies
include prioritizing the most trusted source, favoring the most recent information, or looking
for corroboration across multiple data points.
Another challenge to managing memories occurs when deleting memories. A memory can
be derived from multiple data sources. When a user revokes access to one data source,
data derived from that source should also be removed. Deleting every memory "touched"
by that source can be overly aggressive. A more precise, though computationally expensive,
approach is to regenerate the affected memories from scratch using only the remaining,
valid sources.
Beyond static provenance details, confidence in a memory must evolve. Confidence
increases through corroboration, such as when multiple trusted sources provide consistent
information. However, an efficient memory system must also actively curate its existing
knowledge through memory pruning—a process that identifies and "forgets" memories that
are no longer useful. This pruning can be triggered by several factors.
• Time-based Decay: The importance of a memory can decrease over time. A memory
about a meeting from two years ago is likely less relevant than one from last week.
• Low Confidence: A memory that was created from a weak inference and was never
corroborated by other sources may be pruned.
• Irrelevance: As an agent gains a more sophisticated understanding of a user, it
might determine that some older, trivial memories are no longer relevant to the user's
current goals.
By combining a reactive consolidation pipeline with proactive pruning, the memory manager
ensures that the agent's knowledge base is not just a growing log of everything ever said.
Instead, it’s a curated, accurate, and relevant understanding of the user.
November 2025
51
52. Context Engineering: Sessions, Memory
Accounting for memory lineage during inference
In addition to accounting for a memory’s lineage while curating the corpus’s contents,
a memory’s trustworthiness should also be considered at inference time. An agent's
confidence in a memory should not be static; it must evolve based on new information and
the passage of time. Confidence increases through corroboration, such as when multiple
trusted sources provide consistent information. Conversely, confidence decreases (or
decays) over time as older memories become stale, and it also drops when contradictory
information is introduced. Eventually, the system can "forget" by archiving or deleting low-
confidence memories. This dynamic confidence score is critical during inference time.
Rather than being shown to the user, memories and, if available, their confidence scores are
injected into the prompt, enabling the LLM to assess information reliability and make more
nuanced decisions.
This entire trust framework serves the agent's internal reasoning process. Memories and
their confidence scores are not typically shown to the user directly. Instead, they are injected
into the system prompt, allowing the LLM to weigh the evidence, consider the reliability of its
information, and ultimately make more nuanced and trustworthy decisions.
Triggering memory generation
Although memory managers automate memory extraction and consolidation once generation
is triggered, the agent must still decide when memory generation should be attempted. This
is a critical architectural choice, balancing data freshness against computational cost and
latency. This decision is typically managed by the agent's logic, which can employ several
triggering strategies. Memory generation can be initiated based on various events:
• Session Completion: Triggering generation at the end of a multi-turn session.
November 2025
52
53. Context Engineering: Sessions, Memory
• Turn Cadence: Running the process after a specific number of turns (e.g., every 5 turns).
• Real-Time: Generating memories after every single turn.
• Explicit Command: Activating the process upon a direct user command (e.g.,
"Remember this"
The choice of trigger involves a direct tradeoff between cost and fidelity. Frequent
generation (e.g., real-time) ensures memories are highly detailed and fresh, capturing
every nuance of the conversation. However, this incurs the highest LLM and database
costs and can introduce latency if not handled properly. Infrequent generation (e.g., at
session completion) is far more cost-effective but risks creating lower-fidelity memories, as
the LLM must summarize a much larger block of conversation at once. You also want to be
careful that the memory manager is not processing the same events multiple times, as that
introduces unnecessary cost.
Memory-as-a-Tool
A more sophisticated approach is to allow the agent to decide for itself when to create a
memory. In this pattern, memory generation is exposed as a tool (i.e. `create_memory`); the
tool definition should define what types of information should be considered meaningful. The
agent can then analyze the conversation and autonomously decide to call this tool when it
identifies information that is meaningful to persist. This shifts the responsibility for identifying
"meaningful information" from the external memory manager to the agent (and thus you as
the developer) itself.
For example, you can do this using ADK by packaging your memory generation code into a
Tool 21 that the agent decides to invoke when it deems the conversation meaningful to persist.
You can send the Session to Memory Bank, and Memory Bank will extract and consolidate
memories from the conversation history:
November 2025
53
54. Context Engineering: Sessions, Memory
Python
from
from
from
from
google.adk.agents import LlmAgent
google.adk.memory import VertexAiMemoryBankService
google.adk.runners import Runner
google.adk.tools import ToolContext
def generate_memories(tool_context: ToolContext):
"""Triggers memory generation to remember the session."""
# Option 1: Extract memories from the complete conversation history using the
# ADK memory service.
tool_context._invocation_context.memory_service.add_session_to_memory(
session)
# Option 2: Extract memories from the last conversation turn.
client.agent_engines.memories.generate(
name="projects/.../locations/...reasoningEngines/...",
direct_contents_source={
"events": [
{"content": tool_context._invocation_context.user_content}
]
},
scope={
"user_id": tool_context._invocation_context.user_id,
"app_name": tool_context._invocation_context.app_name
},
# Generate memories in the background
config={"wait_for_completion": False}
)
return {"status": "success"}
agent = LlmAgent(
...,
tools=[generate_memories]
)
runner = Runner(
agent=agent,
app_name=APP_NAME,
session_service=session_service,
memory_service=VertexAiMemoryBankService(
agent_engine_id=AGENT_ENGINE_ID,
project=PROJECT,
location=LOCATION
)
)
Snippet 8: ADK agent using a custom tool to trigger memory generation. Memory Bank will extract and
consolidate the memories.
November 2025
54
55. Context Engineering: Sessions, Memory
Another approach is to leverage internal memory, where the agent actively decides what to
remember from a conversation. In this workflow, the agent is responsible for extracting key
information. Optionally, these extracted memories are then sent to Agent Engine Memory
Bank to be consolidated with the user's existing memories 22 :
Python
def extract_memories(query: str, tool_context: ToolContext):
"""Triggers memory generation to remember information.
Args:
query: Meaningful information that should be persisted about the user.
"""
client.agent_engines.memories.generate(
name="projects/.../locations/...reasoningEngines/...",
# The meaningful information is already extracted from the conversation, so we
# just want to consolidate it with existing memories for the same user.
direct_memories_source={
"direct_memories": [{"fact": query}]
},
scope={
"user_id": tool_context._invocation_context.user_id,
"app_name": tool_context._invocation_context.app_name
},
config={"wait_for_completion": False}
)
return {"status": "success"}
agent = LlmAgent(
...,
tools=[extract_memories]
)
Snippet 9: ADK agent using a custom tool to extract memories from the conversation and trigger
consolidation with Agent Engine Memory Bank. Unlike Snippet 8, the agent is responsible for extracting
memories, not Memory Bank.
November 2025
55
56. Context Engineering: Sessions, Memory
Background vs. Blocking Operations
Memory generation is an expensive operation requiring LLM calls and database writes. For
agents in production, memory generation should almost always be handled asynchronously
as a background process 23 .
After an agent sends its response to the user, the memory generation pipeline can run
in parallel without blocking the user experience. This decoupling is essential for keeping
the agent feeling fast and responsive. A blocking (or synchronous) approach, where the
user has to wait for the memory to be written before receiving a response, would create
an unacceptably slow and frustrating user experience. This necessitates that memory
generation occurs in a service that is architecturally separate from the agent's core runtime.
Memory Retrieval
With a mechanism for memory generation in place, your focus can shift to the critical
task of retrieval. An intelligent retrieval strategy is essential for an agent's performance,
encompassing decisions about which memories should be retrieved and when to
retrieve them.
The strategy for retrieving a memory depends heavily on how memories are organized. For a
structured user profile, retrieval is typically a straightforward lookup for the full profile or
a specific attribute. For a collection of memories, however, retrieval is a far more complex
search problem. The goal is to discover the most pertinent, conceptually related information
from a large pool of unstructured or semi-structured data. The strategies discussed in this
section are designed to solve this complex retrieval challenge for memory collections.
November 2025
56
57. Context Engineering: Sessions, Memory
Memory retrieval searches for the most pertinent memories for the current conversation. An
effective retrieval strategy is crucial; providing irrelevant memories can confuse the model
and degrade its response, while finding the perfect piece of context can lead to a remarkably
intelligent interaction. The core challenge is balancing memory 'usefulness' within a strict
latency budget.
Advanced memory systems go beyond a simple search and score potential memories across
multiple dimensions to find the best fit.
• Relevance (Semantic Similarity): How conceptually related is this memory to the
current conversation?
• Recency (Time-based): How recently was this memory created?
• Importance (Significance): How critical is this memory overall? Unlike relevance, the
“importance” of a memory may be defined at generation-time.
Relying solely on vector-based relevance is a common pitfall. Similarity scores can surface
memories that are conceptually similar but old or trivial. The most effective strategy is a
blended approach that combines the scores from all three dimensions.
For applications where accuracy is paramount, retrieval can be refined using approaches
like query rewriting, reranking, or specialized retrievers. However, these techniques are
computationally expensive and add significant latency, making them unsuitable for most
real-time applications. For scenarios where these complex algorithms are necessary and
the memories do not quickly become stale, a caching layer can be an effective mitigation.
Caching allows the expensive results of a retrieval query to be temporarily stored, bypassing
the high latency cost for subsequent identical requests.
November 2025
57
58. Context Engineering: Sessions, Memory
With query rewriting, an LLM can be used to improve the search query itself. This can
involve rewriting a user's ambiguous input into a more precise query, or expanding a single
query into multiple related ones to capture different facets of a topic. While this significantly
improves the quality of the initial search results, it adds the latency of an extra LLM call at the
start of the process.
With reranking, an initial retrieval fetches a broad set of candidate memories (e.g., the top 50
results) using similarity search. Then, an LLM can re-evaluate and re-rank this smaller set to
produce a more accurate final list 24 .
Finally, you can train a specialized retriever using fine-tuning. However, this requires access
to labeled data and can significantly increase costs.
Ultimately, the best approach to retrieval starts with better memory generation. Ensuring the
memory corpus is high-quality and free of irrelevant information is the most effective way to
guarantee that any set of retrieved memories will be helpful.
Timing for retrieval
The final architectural decision for retrieval is when to retrieve memories. One approach is
proactive retrieval, where memories are automatically loaded at the start of every turn. This
ensures context is always available but introduces unnecessary latency for turns that don't
require memory access. Since memories remain static throughout a single turn, they can be
efficiently cached to mitigate this performance cost.
For example, you can implement proactive retrieval in ADK using the built-in
PreloadMemoryTool or a custom callback 25 :
November 2025
58
59. Context Engineering: Sessions, Memory
Python
# Option 1: Use the built-in PreloadMemoryTool which retrieves memories with
similarity search every turn.
agent = LlmAgent(
...,
tools=[adk.tools.preload_memory_tool.PreloadMemoryTool()]
)
# Option 2: Use a custom callback to have more control over how memories
are retrieved.
def retrieve_memories_callback(callback_context, llm_request):
user_id = callback_context._invocation_context.user_id
app_name = callback_context._invocation_context.app_name
response = client.agent_engines.memories.retrieve(
name="projects/.../locations/...reasoningEngines/...",
scope={
"user_id": user_id,
"app_name": app_name
}
)
memories = [f"* {memory.memory.fact}" for memory in list(response)]
if not memories:
# No memories to add to System Instructions.
return
# Append formatted memories to the System Instructions
llm_request.config.system_instruction += "\nHere is information that you have
about the user:\n"
llm_request.config.system_instruction += "\n".join(memories)
agent = LlmAgent(
...,
before_model_callback=retrieve_memories_callback,
)
Snippet 10: Retrieve memories at the start of every turn with ADK using a built-in tool or custom callback
November 2025
59
60. Context Engineering: Sessions, Memory
Alternatively, you can use reactive retrieval (“Memory-as-a-Tool”) where the agent is
given a tool to query its memory, deciding for itself when to retrieve context. This is more
efficient and robust but requires an additional LLM call, increasing latency and cost; however,
memory is retrieved only when necessary, so the latency cost is incurred less frequently.
Additionally, the agent may not know if relevant information exists to be retrieved. However,
this can be mitigated by making the agent aware of the types of memories available (e.g., in
the tool's description if you’re using a custom tool), allowing for a more informed decision on
when to query.
Python
# Option 1: Use the built-in LoadMemory.
agent = LlmAgent(
...,
tools=[adk.tools.load_memory_tool.LoadMemoryTool()],
)
# Option 2: Use a Custom tool where you can describe what type of information
# might be available.
def load_memory(query: str, tool_context: ToolContext):
"""Retrieves memories for the user.
The following types of information may be stored for the user:
* User preferences, like the user's favorite foods.
...
"""
# Retrieve memories using similarity search.
response = tool_context.search_memory(query)
return response.memories
agent = LlmAgent(
...,
tools=[load_memory],
)
Snippet 11: Configure your ADK agent to decide when memories should be retrieved using a built-in or
custom tool
November 2025
60
61. Context Engineering: Sessions, Memory
Inference with Memories
Once relevant memories have been retrieved, the final step is to strategically place them
into the model's context window. This is a critical process; the placement of memories can
significantly influence the LLM's reasoning, affect operational costs, and ultimately determine
the quality of the final answer.
Memories are primarily presented by appending them to system instructions or injecting
them into conversation history. In practice, a hybrid strategy is often the most effective. Use
the system prompt for stable, global memories (like a user profile) that should always be
present. Otherwise, use dialogue injection or memory-as-a-tool for transient, episodic
memories that are only relevant to the immediate context of the conversation. This balances
the need for persistent context with the flexibility of in-the-moment information retrieval.
Memories in the System Instructions
A simple option to use memories for inference is to append memories to the system
instructions. This method keeps the conversation history clean by appending retrieved
memories directly to the system prompt alongside a preamble, framing them as foundational
context for the entire interaction. For example, you can use Jinja to dynamically add
memories to your system instructions:
November 2025
61
62. Context Engineering: Sessions, Memory
Python
from jinja2 import Template
template = Template("""
{{ system_instructions }}}
<MEMORIES>
Here is some information about the user:
{% for retrieved_memory in data %}* {{ retrieved_memory.memory.fact }}
{% endfor %}</MEMORIES>
""")
prompt = template.render(
system_instructions=system_instructions,
data=retrieved_memories
)
Snippet 12: Build your system instruction using retrieved memories
Including memories in the system instructions gives memories high authority, cleanly
separates context from dialogue, and is ideal for stable, "global" information like a user
profile. However, there is a risk of over-influence, where the agent might try to relate every
topic back to the memories in its core instructions, even when inappropriate.
This architectural pattern introduces several constraints. First, it requires the agent
framework to support dynamic construction of the system prompt before each LLM call;
this functionality isn't always readily supported. Additionally, the pattern is incompatible
with "Memory-as-a-Tool” given that the system prompt must be finalized before the LLM
can decide to call a memory retrieval tool. Finally, it poorly handles non-textual memories.
Most LLMs only accept a text for the system instructions, making it challenging to embed
multimodal content like images or audio directly into the prompt.
November 2025
62
63. Context Engineering: Sessions, Memory
Memories in the Conversation History
In this approach, retrieved memories are injected directly into the turn-by-turn dialogue.
Memories can either be placed before the full conversation history or right before the latest
user query.
However, this method can be noisy, increasing token costs and potentially confusing the
model if the retrieved memories are irrelevant. Its primary risk is dialogue injection, where
the model might mistakenly treat a memory as something that was actually said in the
conversation. You also need to be more careful about the perspective of the memories that
you’re injecting into the conversation; for example, if you’re using the “user” role and user-
level memories, memories should be written in first-person point of view.
A special case of injecting memories into the conversation history is retrieving memories
via tool calls. The memories will be included directly in the conversation as part of the
tool output.
Python
def load_memory(query: str, tool_context: ToolContext):
"""Loads memories into the conversation history..."""
response = tool_context.search_memory(query)
return response.memories
agent = LlmAgent(
...,
tools=[load_memory],
)
Snippet 13: Retrieve memories as a tool, which directly inserts memories into the conversation
November 2025
63
64. Context Engineering: Sessions, Memory
Procedural memories
This whitepaper has focused primarily on declarative memories, a concentration that mirrors
the current commercial memory landscape. Most memory management platforms are also
architected for this declarative approach, excelling at extracting, storing, and retrieving the
"what"—facts, history, and user data.
However, these systems are not designed to manage procedural memories, the mechanism
for improving an agent’s workflows and reasoning. Storing the "how" is not an information
retrieval problem; it is a reasoning augmentation problem. Managing this "knowing how"
requires a completely separate and specialized algorithmic lifecycle, albeit with a similar
high-level structure 26 :
1. Extraction: Procedural extraction requires specialized prompts designed to distill a
reusable strategy or "playbook" from a successful interaction, rather than just capturing a
fact or meaningful information.
2. Consolidation: While declarative consolidation merges related facts (the "what"),
procedural consolidation curates the workflow itself (the "how"). This is an active logic
management process focused on integrating new successful methods with existing
"best practices," patching flawed steps in a known plan, and pruning outdated or
ineffective procedures.
3. Retrieval: The goal is not to retrieve data to answer a question, but to retrieve a plan that
guides the agent on how to execute a complex task. Therefore, procedural memories may
have a different data schema than declarative memories.
This capacity for an agent to 'self-evolve' its logic naturally invites a comparison to a
common adaptation method: fine-tuning—often via Reinforcement Learning from Human
Feedback (RLHF) 27 . While both processes aim to improve agent behavior, their mechanisms
November 2025
64
65. Context Engineering: Sessions, Memory
and applications are fundamentally different. Fine-tuning is a relatively slow, offline training
process that alters model weights. Procedural memory provides a fast, online adaptation by
dynamically injecting the correct "playbook" into the prompt, guiding the agent via in-context
learning without requiring any fine-tuning.
Testing and Evaluation
Now that you have a memory-enabled agent, you should validate the behavior of your
memory-enabled agent via comprehensive quality and evaluation tests. Evaluating an
agent's memory is a multi-layered process. Evaluation requires verifying that the agent
is remembering the right things (quality), that it can find those memories when needed
(retrieval), and that using those memories actually helps it accomplish its goals (task
success). While academia focuses on reproducible benchmarks, industry evaluation
is centered on how memory directly impacts the performance and usability of a
production agent.
Memory generation quality metrics evaluate the content of the memories themselves,
answering the question: "Is the agent remembering the right things?" This is typically
measured by comparing the agent's generated memories against a manually created "golden
set" of ideal memories.
• Precision: Of all the memories the agent created, what percentage are accurate and
relevant? High precision guards against an "over-eager" memory system that pollutes the
knowledge base with irrelevant noise.
• Recall: Of all the relevant facts it should have remembered from the source, what
percentage did it capture? High recall ensures the agent doesn't miss critical information.
• F1-Score: The harmonic mean of precision and recall, providing a single, balanced
measure of quality.
November 2025
65
66. Context Engineering: Sessions, Memory
Memory retrieval performance metrics evaluate the agent's ability to find the right memory
at the right time.
• Recall@K: When a memory is needed, is the correct one found within the top 'K' retrieved
results? This is the primary measure of a retrieval system's accuracy.
• Latency: Retrieval is on the "hot path" of an agent's response. The entire retrieval process
must execute within a strict latency budget (e.g., under 200ms) to avoid degrading the
user experience.
End-to-End task success metrics are the ultimate test, answering the question: "Does
memory actually help the agent perform its job better?" This is measured by evaluating the
agent's performance on downstream tasks using its memory, often with an LLM "judge"
comparing the agent's final output to a golden answer. The judge determines if the agent's
answer was accurate, effectively measuring how well the memory system contributed to the
final outcome.
Evaluation is not a one-time event; it’s an engine for continuous improvement. The metrics
above provide the data needed to identify weaknesses and systematically enhance the
memory system over time. This iterative process involves establishing a baseline, analyzing
failures, tuning the system (e.g., refining prompts, adjusting retrieval algorithms), and re-
evaluating to measure the impact of the changes.
While the metrics above focus on quality, production-readiness also depends on
performance. For each evaluation area, it is critical to measure the latency of underlying
algorithms and their ability to scale under load. Retrieving memories “on the hot-path”
may have a strict, sub-second latency budget. Generation and consolidation, while often
asynchronous, must have enough throughput to keep up with user demand. Ultimately, a
successful memory system must be intelligent, efficient, and robust for real-world use.
November 2025
66
67. Context Engineering: Sessions, Memory
Production considerations for Memory
In addition to performance, transitioning a memory-enabled agent from prototype to
production demands a focus on enterprise-grade architectural concerns. This move
introduces critical requirements for scalability, resilience, and security. A production-grade
system must be designed not only for intelligence but also for enterprise-level robustness.
To ensure the user experience is never blocked by the computationally expensive process of
memory generation, a robust architecture must decouple memory processing from the main
application logic. While this is an event-driven pattern, it is typically implemented via direct,
non-blocking API calls to a dedicated memory service rather than a self-managed message
queue. The flow looks like this:
1. Agent pushes data: After a relevant event (e.g., a session ends), the agent application
makes a non-blocking API call to the memory manager, "pushing" the raw source data (like
the conversation transcript) to be processed.
2. Memory manager processes in the background: The memory manager service
immediately acknowledges the request and places the generation task into its own
internal, managed queue. It is then solely responsible for the asynchronous heavy lifting:
making the necessary LLM calls to extract, consolidate, and format memories. The
manager may delay processing the events until a certain period of inactivity elapses.
3. Memories are persisted: The service writes the final memories—which may be new
entries or updates to existing ones—to a dedicated, durable database. For managed
memory managers, the storage is built-in.
4. Agent retrieves memories: The main agent application can then query this memory store
directly when it needs to retrieve context for a new user interaction.
November 2025
67
68. Context Engineering: Sessions, Memory
This service-based, non-blocking approach ensures that failures or latency in the memory
pipeline do not directly impact the user-facing application, making the system far more
resilient. It also informs the choice between online (real-time) generation, which is ideal for
conversational freshness, and offline (batch) processing, which is useful for populating the
system from historical data.
As an application grows, the memory system must handle high-frequency events without
failure. Given concurrent requests, the system must prevent deadlocks or race conditions
when multiple events try to modify the same memory. You can mitigate race conditions
using transactional database operations or optimistic locking; however, this can introduce
queuing or throttling when multiple requests are trying to modify the same memories. A
robust message queue is essential to buffer high volumes of events and prevent the memory
generation service from being overwhelmed.
The memory service must also be resilient to transient errors (failure handling). If an LLM
call fails, the system should use a retry mechanism with exponential backoff and route
persistent failures to a dead-letter queue for analysis.
For global applications, the memory manager must use a database with built-in multi-
region replication to ensure low latency and high availability. Client-side replication is not
feasible because consolidation requires a single, transactionally consistent view of the
data to prevent conflicts. Therefore, the memory system must handle replication internally,
presenting a single, logical datastore to the developer while ensuring the underlying
knowledge base is globally consistent.
Managed memory systems, like Agent Engine Memory Bank, should help you address these
production considerations, so that you can focus on the core agent logic.
November 2025
68
69. Context Engineering: Sessions, Memory
Privacy and security risks
Memories are derived from and include user data, so they require stringent privacy and
security controls. A useful analogy is to think of the system's memory as a secure corporate
archive managed by a professional archivist, whose job is to preserve valuable knowledge
while protecting the company.
The cardinal rule for this archive is data isolation. Just as an archivist would never mix
confidential files from different departments, memory must be strictly isolated at the user or
tenant level. An agent serving one user must never have access to the memories of another,
enforced using restrictive Access Control Lists (ACLs). Furthermore, users must have
programmatic control over their data, with clear options to opt-out of memory generation or
request the deletion of all their files from the archive.
Before filing any document, the archivist performs critical security steps. First, they
meticulously go through each page to redact sensitive personal information (PII), ensuring
knowledge is saved without creating a liability. Second, the archivist is trained to spot and
discard forgeries or intentionally misleading documents—a safeguard against memory
poisoning 28 . In the same way, the system must validate and sanitize information before
committing it to long-term memory to prevent a malicious user from corrupting the agent's
persistent knowledge through prompt injection. The system must include safeguards like
Model Armor to validate and sanitize information before committing it to long-term memory 29 .
Additionally, there is an exfiltration risk if multiple users share the same set of memories, like
with procedural memories (which teach an agent how to do something). For example, if a
procedural memory from one user is used as an example for another—like sharing a memo
company-wide—the archivist must first perform rigorous anonymization to prevent sensitive
information from leaking across user boundaries.
November 2025
69
70. Context Engineering: Sessions, Memory
Conclusion
This whitepaper has explored the discipline of Context Engineering, focusing on its two
central components: Sessions and Memory. The journey from a simple conversational turn
to a piece of persistent, actionable intelligence is governed by this practice, which involves
dynamically assembling all necessary information—including conversation history, memories,
and external knowledge—into the LLM’s context window. This entire process relies on the
interplay between two distinct but interconnected systems: the immediate Session and the
long-term Memory.
The Session governs the "now," acting as a low-latency, chronological container for a single
conversation. Its primary challenge is performance and security, requiring low-latency
access and strict isolation. To prevent context window overflow and latency, you must use
extraction techniques like token-based truncation or recursive summarization to compact
content within the Session's history or a single request payload. Furthermore, security is
paramount, mandating PII redaction before session data is persisted.
Memory is the engine of long-term personalization and the core mechanism for
persistence across multiple sessions. It moves beyond RAG (which makes an agent an expert
on facts) to make the agent an expert on the user. Memory is an active, LLM-driven ETL
pipeline—responsible for extraction, consolidation, and retrieval—that distills the most
important information from conversation history. With extraction, the system distills the
most critical information into key memory points. Following this, consolidation curates and
integrates this new information with the existing corpus, resolving conflicts, and deleting
redundant data to ensure a coherent knowledge base. To maintain a snappy user experience,
memory generation must run as an asynchronous background process after the agent has
responded. By tracking provenance and employing safeguards against risks like memory
poisoning, developers can build trusted, adaptive assistants that truly learn and grow with
the user.
November 2025
70
71. Context Engineering: Sessions, Memory
Endnotes
1.
https://cloud.google.com/use-cases/retrieval-augmented-generation?hl=en
2. https://arxiv.org/abs/2301.00234
3. https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/sessions/overview
4. https://langchain-ai.github.io/langgraph/concepts/multi_agent/#message-passing-between-agents
5. https://google.github.io/adk-docs/agents/multi-agents/
6. https://google.github.io/adk-docs/agents/multi-agents/#c-explicit-invocation-agenttool
7. https://agent2agent.info/docs/concepts/message/
8. https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/
9. https://cloud.google.com/security-command-center/docs/model-armor-overview
10. https://ai.google.dev/gemini-api/docs/long-context#long-context-limitations
11. https://huggingface.co/blog/Kseniase/memory
12. https://langchain-ai.github.io/langgraph/concepts/memory/#semantic-memory
13. https://langchain-ai.github.io/langgraph/concepts/memory/#semantic-memory
14. https://arxiv.org/pdf/2412.15266
15. https://arxiv.org/pdf/2412.15266
16. https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference
#sample-requests-text-gen-multimodal-prompt
17. https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/memory-bank/generate-memories
18. https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output
19. https://cloud.google.com/agent-builder/agent-engine/memory-bank/set-up#memory-bank-config
20. https://arxiv.org/html/2504.19413v1
21. https://google.github.io/adk-docs/tools/#how-agents-use-tools
November 2025
71
72. Context Engineering: Sessions, Memory
22. https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/memory-bank/
generate-memories#consolidate-pre-extracted-memories
23. https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/memory-bank/
generate-memories#background-memory-generation
24. https://arxiv.org/pdf/2503.08026
25. https://google.github.io/adk-docs/callbacks/
26. https://arxiv.org/html/2508.06433v2
27. https://cloud.google.com/blog/products/ai-machine-learning/rlhf-on-google-cloud
28. https://arxiv.org/pdf/2503.03704
29. https://cloud.google.com/security-command-center/docs/model-armor-overview
30. https://cloud.google.com/architecture/choose-design-pattern-agentic-ai-system
November 2025
72