Agentic Memory: A Detailed Breakdown

Imagine one day you hire a brilliant freelancer. On her first day, she’s incredible, catches every bug, writes clean docs, & even suggests improvements you hadn’t thought of. You’re impressed.

On the second day, you walk in & say, “Hey, remember that issue we discussed yesterday?”

She pauses. Looks at you. Slight smile.

“Sorry… what issue?”

No memory. No context. Completely gone. You will be shocked as I am while writing this here.

That’s exactly how most LLMs behave. Every new conversation is a fresh start. The model doesn’t know who you are, what you’ve built together, or what you discussed even a few minutes ago in another chat window.

For a simple chatbot, that’s fine. But for an agent, something that runs tasks, makes decisions, and improves over time, this kind of amnesia is a dealbreaker.

Because real intelligence isn’t just about responding well. It’s about remembering, learning, and building on what came before.

Memory is what turns a stateless system into something that can actually evolve.

What is agentic memory, really?

Agentic memory isn’t just one thing. It’s more like a system working behind the scenes, different types of storage, ways to retrieve information, and smart strategies to manage it all so an agent can actually carry context over time.

The key idea is simple: memory isn’t doing one job; it’s doing three very different ones at the same time.

Continuity is about identity. It’s how the agent knows who you are, what you prefer, and what you’ve already built together. Without it, every interaction feels like starting from scratch.

Context is about the task at hand. What just happened, which tool was used, what came back as a result, and what needs to happen next. It’s what keeps multi-step workflows from falling apart.

Learning is about getting better. Understanding what worked, what didn’t, and slowly improving decisions over time instead of repeating the same mistakes.

Put together, it makes agents feel consistent, reliable, and a little more intelligent with every interaction.

Image

Designed by author @techwith_ram

A well-designed agent memory system handles all three, using different storage backends for each.

The 4-types of memory

The field has converged on four distinct memory types. Think of them as four different parts of the brain, each evolved for a specific job.

Image

Designed by author @techwith_ram

1. In-context memory

The context window is your agent's working desk. Everything on it is instantly accessible. The model can reason over it in a single forward pass. No retrieval step required.

But the desk has a size limit. Every token costs money and time. And when the session ends, the desk gets wiped clean.

What lives in context?

  • System prompt: agent persona, rules, capabilities, current date/user info
  • Conversation history: the back-and-forth so far this session
  • Tool call results: outputs from tools the agent just invoked
  • Retrieved memories: chunks pulled in from external storage
  • Scratchpad: intermediate reasoning (think-step-by-step outputs)

Image

Designed by author @techwith_ram

The sliding window problem

In long conversations, history accumulates and eventually overflows the context limit. The naive solution of truncating the oldest messages loses important early context. Better strategies:

  • Summarization: periodically compress old turns into a brief summary and replace them with the summary
  • Selective retention: keep turns that contain key facts, decisions, or tool results; discard small talk.
  • Offload to external memory: extract important facts into a vector store, then retrieve them as needed

2. External memory

External memory is anything that persists outside the model—databases, vector stores, key-value stores, and files. It survives session boundaries. Your agent can remember something from six months ago if you store it right.

There are two flavors of external storage:

Structured Store (Exact Lookup): PostgreSQL, Redis, SQLite. You query by key, ID, or SQL. Fast, predictable, great for user profiles, preferences, and structured data.

Vector Stores (Semantic Search): Pinecone, Chroma, pgvector. You query by meaning, "find memories similar to this concept." "Essential for unstructured notes and episodic recall.

Image

Designed by author @techwith_ram

The retrieval step is a bottleneck. If you don't retrieve the right memories, the agent behaves as if they don't exist. Good memory architecture is 20% storage and 80% retrieval design.

3. Episodic memory

Episodic memory is the most underappreciated type. While external memory stores facts, episodic memory stores events, specifically the outcomes of past actions.

The simplest form is a structured log: every time the agent completes a task, it records what happened. Over time, this log becomes a rich source of self-knowledge the agent can consult before making decisions.

What an episode looks like:

{
  "episode_id": "ep_20240315_003",
  "timestamp": "2024-03-15T14:23:11Z",
  "task": "Summarize 50-page PDF into 3 bullet points",
  "approach": "Sequential chunking, 2000 tokens per chunk",
  "outcome": "success",
  "duration_ms": 4820,
  "token_cost": 12400,
  "quality_score": 0.91,
  "notes": "Worked well. Hierarchical chunking would be faster.",
  "embedding": [0.023, -0.441, 0.182, /* ... 1536 dims */]
}

When a new task comes in, the agent retrieves the most semantically similar past episodes and uses them to pick a strategy. This is essentially few-shot learning from personal history rather than from a handcrafted dataset.

The reflection loop👇

Image

Designed by author @techwith_ram

Semantic/parametric memory

This is the memory the model was born with. Everything is encoded in weights during training, facts about the world, language patterns, reasoning strategies, coding conventions, and cultural knowledge.

It's always there. The agent never has to retrieve it. But it comes with hard limitations:

  • Frozen at training time: the model doesn't know what happened after its cutoff date
  • Can't be updated at runtime: you can't inject new permanent facts without retraining or fine-tuning
  • Opaque: you can't inspect exactly what the model "knows" or doesn't
  • Hallucination-prone: the model fills gaps with plausible-but-wrong completions.

For anything time-sensitive, domain-specific, or private, don't rely on parametric memory. Use external retrieval. Parametric memory is your fallback for general world knowledge when no better source exists.

The right mental model: parametric memory is the agent's general education. External, episodic, and in-context memory are the agent's on-the-job experience. The best agents combine both.

How does memory flow through an agent loop?

Let's put it all together. Here's what happens every single time an agent processes a request—showing every memory system in action.

Image

Designed by author @techwith_ram

Notice that memory operations bookend the LLM call: retrieval before, writing after. The model itself is stateless; the memory system is what gives the illusion of a stateful, aware agent.

Building a memory layer

Let's build this. We'll use Python with OpenAI for embeddings and ChromaDB as our local vector store. The same concepts apply to any other stack—swap the libraries.

Let's build this. We'll use Python with OpenAI for embeddings and ChromaDB as our local vector store. The same concepts apply to any other stack—swap the libraries.

pip install chromadb openai anthropic python-dotenv

The MemoryStore class

This handles writing memories (with embeddings) and semantic retrieval. It's the foundation everything else sits on.

import chromadb
from openai import OpenAI
from datetime import datetime
import json, uuid

class MemoryStore:
    """Persistent vector memory for an AI agent."""

    def __init__(self, agent_id: str, persist_dir: str = "./memory_db"):
        self.agent_id = agent_id
        self.openai = OpenAI()

        # ChromaDB stores vectors on disk, persists across restarts
        self.client = chromadb.PersistentClient(path=persist_dir)
        self.collection = self.client.get_or_create_collection(
            name=f"agent_{agent_id}_memories",
            metadata={"hnsw:space": "cosine"}  # cosine similarity
        )

    def _embed(self, text: str) -> list[float]:
        """Convert text to embedding vector using OpenAI."""
        response = self.openai.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

    def remember(
        self,
        content: str,
        memory_type: str = "general",
        metadata: dict = None
    ) -> str:
        """Store a memory. Returns the memory ID."""
        memory_id = str(uuid.uuid4())
        embedding = self._embed(content)

        meta = {
            "type": memory_type,
            "timestamp": datetime.utcnow().isoformat(),
            "agent_id": self.agent_id,
            **(metadata or {})
        }

        self.collection.add(
            ids=[memory_id],
            embeddings=[embedding],
            documents=[content],
            metadatas=[meta]
        )
        return memory_id

    def recall(
        self,
        query: str,
        k: int = 5,
        memory_type: str = None,
        min_relevance: float = 0.6
    ) -> list[dict]:
        """Retrieve the k most relevant memories for a query."""
        query_embedding = self._embed(query)

        where = {"type": memory_type} if memory_type else None

        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=k,
            where=where,
            include=["documents", "metadatas", "distances"]
        )

        memories = []
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        ):
            relevance = 1 - dist  # cosine distance → similarity
            if relevance >= min_relevance:
                memories.append({
                    "content": doc,
                    "metadata": meta,
                    "relevance": round(relevance, 3)
                })

        return sorted(memories, key=lambda x: x["relevance"], reverse=True)

    def forget(self, memory_id: str):
        """Delete a specific memory (GDPR compliance, stale data, etc.)"""
        self.collection.delete(ids=[memory_id])

The EpisodicLogger class

Now let's add the episode logging layer on top.

from .store import MemoryStore
from dataclasses import dataclass, asdict
from typing import Optional
import time

@dataclass
class Episode:
    task: str
    approach: str
    outcome: str           # "success" | "partial" | "failure"
    duration_ms: int
    token_cost: int
    quality_score: float   # 0.0 – 1.0, set by evaluator or user
    notes: str = ""
    error: Optional[str] = None

class EpisodicLogger:
    def __init__(self, memory_store: MemoryStore):
        self.store = memory_store

    def log(self, episode: Episode):
        """Save an episode to memory as a searchable document."""
        # Build a rich text representation for semantic search
        doc = (
            f"Task: {episode.task}\n"
            f"Approach: {episode.approach}\n"
            f"Outcome: {episode.outcome}\n"
            f"Notes: {episode.notes}"
        )
        self.store.remember(
            content=doc,
            memory_type="episode",
            metadata={
                "outcome": episode.outcome,
                "quality_score": episode.quality_score,
                "duration_ms": episode.duration_ms,
                "token_cost": episode.token_cost,
            }
        )

    def recall_similar(self, task: str, k: int = 3) -> list[dict]:
        """Find past episodes similar to the current task."""
        return self.store.recall(
            query=task,
            k=k,
            memory_type="episode",
            min_relevance=0.65
        )

Putting it together: a memory-augmented agent

import anthropic
from memory.store import MemoryStore
from memory.episodic import EpisodicLogger, Episode
import time

class MemoryAugmentedAgent:
    def __init__(self, agent_id: str):
        self.client = anthropic.Anthropic()
        self.memory = MemoryStore(agent_id)
        self.episodes = EpisodicLogger(self.memory)

    def _build_memory_context(self, user_message: str) -> str:
        """Retrieve relevant memories and format them for injection."""
        # Semantic search for related facts
        memories = self.memory.recall(user_message, k=4)
        # Similar past task approaches
        episodes = self.episodes.recall_similar(user_message, k=2)

        context_parts = []

        if memories:
            context_parts.append("## Relevant memories\n" +
                "\n".join([
                    f"- [{m['metadata']['type']}] {m['content']}"
                    f" (relevance: {m['relevance']})"
                    for m in memories
                ])
            )

        if episodes:
            context_parts.append("## Past similar tasks\n" +
                "\n".join([
                    f"- {e['content'][:200]}..."
                    for e in episodes
                ])
            )

        return "\n\n".join(context_parts) if context_parts else ""

    def run(self, user_message: str) -> str:
        start = time.time()

        # 1. Retrieve relevant memory
        memory_context = self._build_memory_context(user_message)

        # 2. Build system prompt with injected memory
        system = """You are a helpful agent with memory.
You have access to relevant context from past interactions.
Use this context to give better, more personalized responses.
"""
        if memory_context:
            system += f"\n\n{memory_context}"

        # 3. Call the model
        response = self.client.messages.create(
            model="claude-opus-4-6",
            max_tokens=1024,
            system=system,
            messages=[{"role": "user", "content": user_message}]
        )
        answer = response.content[0].text
        duration = int((time.time() - start) * 1000)

        # 4. Save useful info to memory for next time
        self.memory.remember(
            content=f"User asked: {user_message[:200]}",
            memory_type="interaction"
        )

        # 5. Log the episode
        self.episodes.log(Episode(
            task=user_message[:200],
            approach="single-turn with memory retrieval",
            outcome="success",
            duration_ms=duration,
            token_cost=response.usage.input_tokens + response.usage.output_tokens,
            quality_score=1.0,  # would come from evaluation in prod
        ))

        return answer

Vector databases

It's the heart of any serious memory system. Instead of querying by exact match (like SQL), it finds the nearest neighbors of a vector in high-dimensional space. This is what enables semantic search — finding memories that are conceptually related even if they share no words.

How similarity search works

Every memory is converted to a vector (an array of 1,536 floats with OpenAI's embedding model). Conceptually similar texts produce similar vectors. When you query, you embed the query and find the closest vectors using cosine similarity.

import numpy as np

def cosine_similarity(a: list, b: list) -> float:
    """
    1.0  = identical meaning
    0.0  = unrelated
    -1.0 = opposite meaning
    """
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example: these two sentences will have high similarity
embedding_a = embed("The user prefers dark mode")
embedding_b = embed("They like their interface theme to be dark")
score = cosine_similarity(embedding_a, embedding_b)
# → ~0.91 (very similar)

Start with ChromaDB for local development. When you're ready to deploy, evaluate pgvector if you're already on Postgres with zero extra infrastructure. Use Pinecone or Qdrant when you need serious scale.

Memory management

Real memory systems don't just accumulate. They curate. An ever-growing, unfocused store degrades over time—retrieval gets noisier, latency climbs, and contradictory memories confuse the agent.

You need a forgetting strategy. Here are the three main approaches:

1. Time-based decay

Older memories are less relevant. Score memories by a combination of recency and semantic relevance. The formula used in research:

import math
from datetime import datetime

def memory_score(
    relevance: float,      # cosine similarity 0–1
    importance: float,     # stored at write time 0–1
    created_at: datetime,  # when memory was formed
    recency_weight: float = 0.3,
    decay_factor: float = 0.995
) -> float:
    """
    Inspired by the Generative Agents paper (Park et al., 2023).
    Balances: how relevant, how important, how recent.
    """
    hours_old = (datetime.utcnow() - created_at).total_seconds() / 3600
    recency = math.pow(decay_factor, hours_old)

    return (
        relevance * 0.4 +
        importance * 0.3 +
        recency * recency_weight
    )

2. Importance scoring at write time

When storing a memory, ask the model to score its own output for importance. Only store high-scoring items. This filters noise at the source.

import re

async def score_importance(client, content: str) -> float:
    """Ask the LLM if information is worth saving (0.0 to 1.0)."""
    
    prompt = f"""Rate the importance of saving this for future interactions. 
    0.0 = trivial (greeting)
    0.5 = moderately useful
    1.0 = critical (preferences, errors, decisions)
    
    Information: {content}
    Reply with ONLY the number."""

    try:
        response = await client.messages.create(
            model="claude-3-haiku-20240307", # Use the current available Haiku model
            max_tokens=10,
            messages=[{"role": "user", "content": prompt}]
        )
        
        # Extract the first string that looks like a float/int
        text = response.content[0].text.strip()
        match = re.search(r"[-+]?\d*\.\d+|\d+", text)
        
        if match:
            score = float(match.group())
            return max(0.0, min(1.0, score))
            
    except Exception:
        pass 
        
    return 0.5  # Default fallback

3. Periodic consolidation

Run a nightly job that merges duplicate or highly similar memories into a single canonical summary. This is analogous to how human sleep consolidates memories.

async def consolidate_memories(store: MemoryStore, similarity_threshold: float = 0.92):
    """Efficiently merge near-duplicate memories using vector search."""
    
    all_mems = store.collection.get(include=["documents", "embeddings", "ids"])
    if not all_mems["ids"]:
        return

    visited = set()
    consolidated_docs = []

    for i, (mem_id, doc, emb) in enumerate(zip(
        all_mems["ids"], all_mems["documents"], all_mems["embeddings"]
    )):
        if mem_id in visited:
            continue
            
        # Use the vector store's built-in search to find neighbors
        # this is much faster than a manual nested loop
        results = store.collection.query(
            query_embeddings=[emb],
            n_results=10, # Adjust based on expected density
            include=["documents", "distances"]
        )

        # Identify group members (1.0 - distance = cosine similarity)
        group = [doc]
        visited.add(mem_id)

        for res_id, res_doc, dist in zip(results["ids"][0], results["documents"][0], results["distances"][0]):
            sim = 1.0 - dist
            if res_id != mem_id and res_id not in visited and sim >= similarity_threshold:
                group.append(res_doc)
                visited.add(res_id)

        # Process the group
        if len(group) > 1:
            summary = await summarize_group(group) # Likely needs to be async
            consolidated_docs.append(summary)
        else:
            consolidated_docs.append(doc)

    # Atomic-ish replacement: Clear and Re-populate
    store.collection.delete(where={})
    for doc in consolidated_docs:
        await store.remember(doc)

FInal Thoughts

At the end of the day, memory is what makes an AI feel less like a tool & more like a partner. Without it, every interaction starts from zero. With it, an agent can understand, adapt, and improve over time.

The real power isn’t just in the model; it’s in how you design what the model remembers, what it forgets, and how it uses that information.

Build the memory layer right, and everything else becomes smarter.

Codes are AI-generated.

Follow @techwith_ram for more such posts.

Главная - Вики-сайт
Copyright © 2011-2026 iteam. Current version is 2.155.0. UTC+08:00, 2026-03-29 00:21
浙ICP备14020137号-1 $Гость$