Agent Memory Systems: How AI Agents Remember
A practical guide to giving AI agents memory—from simple conversation history to sophisticated retrieval systems that actually work.
I’ve been diving deep into agent memory systems over the past month. It started when I noticed something frustrating: every time I started a new conversation with an AI assistant, it had completely forgotten everything we’d discussed before. Previous context, preferences I’d mentioned, decisions we’d made together—all gone.
That’s when I realized that memory isn’t just a nice-to-have feature. It’s what separates a useful AI agent from a glorified chatbot. An assistant that forgets you every conversation isn’t really an assistant at all.
Here’s what I’ve learned about how agent memory actually works, and how to implement it yourself.
The Fundamental Problem
Large language models have no persistent memory by default. Every API call starts fresh. The model doesn’t know who you are, what you discussed yesterday, or what your preferences are. It only knows what’s in the current prompt.
This creates two immediate challenges. First, context windows are finite. Even with models supporting 128K or 200K tokens, that’s still a limit. A few weeks of daily conversations will exceed it. Second, costs scale linearly with context length. Stuffing your entire conversation history into every prompt gets expensive fast.
The solution is some form of external memory system—a way to store information outside the model and retrieve it when relevant.
Memory Types: A Human Analogy
To understand agent memory, it helps to think about how human memory works. Cognitive scientists typically break it into several types, and these map surprisingly well to AI systems.
Sensory memory is the raw input we perceive—sights, sounds, touches. In AI terms, this is the embedding representation of raw inputs before any processing. It’s fleeting and mostly unconscious.
Short-term memory (or working memory) holds information we’re actively using right now. Humans can hold about 7 items in working memory for 20-30 seconds. For AI agents, this maps to in-context learning—the current conversation that fits in the context window. It’s limited and temporary.
Long-term memory stores information for extended periods, potentially forever. This is where it gets interesting for AI systems. Long-term memory breaks down further into episodic memory (specific events: “the user mentioned they’re traveling to Tokyo next month”), semantic memory (facts and concepts: “the user is vegetarian”), and procedural memory (skills and habits: “the user prefers bullet points in summaries”).
For practical agent development, the key insight is that you need different strategies for short-term versus long-term memory. Trying to handle everything the same way leads to either context overflow or lost information.
Implementation Approaches
There are four main approaches to implementing agent memory, ranging from simple to sophisticated.
Sliding Window
The simplest approach is keeping the last N messages in context. When you exceed N, you drop the oldest messages. This is what most chatbot interfaces do by default.
def sliding_window(messages, max_messages=20):
"""Keep only the most recent messages."""
if len(messages) > max_messages:
# Always keep system prompt
system = [m for m in messages if m["role"] == "system"]
recent = messages[-max_messages:]
return system + recent
return messages
The advantage is simplicity—there’s nothing to configure, no external dependencies. The disadvantage is that you lose information permanently. If the user mentioned something important 25 messages ago, it’s gone.
Sliding windows work fine for simple Q&A bots where each conversation is self-contained. They fall apart for anything requiring continuity across sessions.
Summarization
A smarter approach is to periodically summarize older conversations, compressing them into a shorter form that preserves key information.
async def summarize_old_context(messages, llm_client, summary_threshold=15):
"""Summarize older messages to compress context."""
if len(messages) < summary_threshold:
return messages
# Split into old and recent
old_messages = messages[:-10]
recent_messages = messages[-10:]
# Generate summary of old messages
summary_prompt = f"""Summarize the key points from this conversation history.
Focus on: user preferences, decisions made, important facts mentioned.
Conversation:
{format_messages(old_messages)}
"""
summary = await llm_client.complete(summary_prompt)
# Return summary + recent messages
return [
{"role": "system", "content": f"Previous conversation summary: {summary}"},
*recent_messages
]
This approach compresses context while preserving information. A 50-message conversation might compress to a 200-token summary. The tradeoff is lossy compression—you can’t perfectly reconstruct the original conversation from a summary, and the summarization itself costs tokens.
Summarization works well for ongoing conversations where you need general context but not exact details. It’s less suitable when precise recall matters.
RAG-Based Retrieval
The most flexible approach treats memories as documents in a retrieval system. Every message or extracted fact gets embedded and stored in a vector database. At query time, you retrieve the most relevant memories based on semantic similarity. (For a deeper dive into RAG architectures, see my earlier article on RAG Done Right.)
# openai v1.x, chromadb v0.4.x (Feb 2026)
import uuid
from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("agent_memory")
def store_memory(text, metadata=None):
"""Store a memory with its embedding."""
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
embedding = response.data[0].embedding
collection.add(
documents=[text],
embeddings=[embedding],
metadatas=[metadata or {}],
ids=[str(uuid.uuid4())]
)
def retrieve_memories(query, top_k=5):
"""Retrieve relevant memories for a query."""
response = client.embeddings.create(
input=query,
model="text-embedding-3-small"
)
query_embedding = response.data[0].embedding
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
return results["documents"][0]
RAG-based memory scales well because retrieval is O(log n) with proper indexing, regardless of how many memories you’ve stored. You can have thousands of memories and still retrieve relevant ones quickly.
The challenge is that semantic similarity isn’t always what you want. Sometimes the most relevant memory isn’t the most semantically similar to the current query. A user asking “what should I eat tonight?” might benefit from remembering they mentioned a peanut allergy three weeks ago, but “peanut allergy” isn’t semantically similar to dinner recommendations.
Hybrid Approaches
In practice, the best systems combine multiple approaches. A typical hybrid might use a sliding window for the immediate conversation, summarization for older context in the current session, and RAG retrieval for long-term memories across sessions.
graph TD
A[User Message] --> B{Context Builder}
B --> C[Long-term Memory<br/>Vector DB Retrieval]
B --> D[Session Summary<br/>Compressed History]
B --> E[Recent Messages<br/>Sliding Window]
C --> F[Combined Context]
D --> F
E --> F
F --> G[LLM]
G --> H[Response]
H --> I{Memory Writer}
I --> J[Extract Facts]
J --> K[Score Importance]
K -->|High| C
async def build_context(current_messages, user_id, llm_client, memory_store):
"""Build context using hybrid memory approach."""
context_parts = []
# 1. Retrieve relevant long-term memories
recent_query = current_messages[-1]["content"]
memories = memory_store.retrieve(user_id, recent_query, top_k=5)
if memories:
context_parts.append(f"Relevant memories:\n{format_memories(memories)}")
# 2. Add session summary if conversation is long
if len(current_messages) > 15:
old_summary = await summarize_old_context(
current_messages[:-10],
llm_client
)
context_parts.append(f"Earlier in this conversation:\n{old_summary}")
# 3. Include recent messages directly (sliding window)
recent = current_messages[-10:]
# Combine into final context
system_context = "\n\n".join(context_parts)
return [{"role": "system", "content": system_context}] + recent
This gives you the benefits of each approach: immediate context from the sliding window, session continuity from summarization, and long-term recall from RAG.
What to Store
Deciding what to store is as important as how to store it. There are three main strategies.
Raw messages are the simplest—just store every user and assistant message verbatim. This preserves everything but creates a lot of noise. Most messages aren’t worth remembering long-term.
Extracted facts use an LLM to pull out structured information from conversations: “User prefers morning meetings,” “User is allergic to shellfish,” “User’s project deadline is March 15.” This is more compact and searchable but requires an extraction step and can miss nuance.
Both gives you maximum flexibility. Store raw messages for detailed recall, plus extracted facts for efficient retrieval. This is what systems like Mem0 do—they maintain a compressed “memory” representation alongside the raw history.
import json
async def extract_facts(message, llm_client):
"""Extract memorable facts from a message."""
prompt = f"""Extract any facts worth remembering from this message.
Focus on: preferences, personal details, important dates, decisions made.
Return as a JSON array of strings, or empty array if nothing notable.
Message: {message}
"""
response = await llm_client.complete(prompt, response_format="json")
return json.loads(response)
# Usage
user_message = "I'm vegetarian and trying to lose weight. Also, my anniversary is next Tuesday."
facts = await extract_facts(user_message, client)
# ["User is vegetarian", "User is trying to lose weight", "User's anniversary is next Tuesday"]
When to Write Memories
You have several options for when to commit information to long-term memory.
Every turn captures everything but generates a lot of writes and potentially stores irrelevant chatter.
End of session batches writes but requires defining what a “session” is. For async messaging, this gets fuzzy.
Explicit triggers only store when the user says something like “remember this” or when the agent decides something is important. This is selective but might miss implicit preferences.
Importance scoring uses an LLM to rate how memorable each piece of information is, only storing above a threshold.
async def score_importance(message, llm_client):
"""Score how important a message is to remember long-term (0-10)."""
prompt = f"""Rate how important this information is to remember for future conversations.
10 = Critical (allergies, major life events, strong preferences)
5 = Useful (minor preferences, context)
0 = Ephemeral (small talk, transient questions)
Message: {message}
Return just the number.
"""
score = await llm_client.complete(prompt)
return int(score.strip())
Most production systems use a combination: always store extracted facts above a certain importance threshold, plus allow explicit “remember this” triggers.
Memory Retrieval Strategies
Storing memories is only half the problem. Retrieving the right ones at the right time is equally important.
Pure semantic search retrieves memories most similar to the current query. This works well when the user is asking about something directly related to past conversations.
Recency weighting boosts recently stored memories. Something mentioned yesterday is probably more relevant than something from six months ago, all else being equal.
# scikit-learn v1.x (Feb 2026)
import math
from datetime import datetime
from sklearn.metrics.pairwise import cosine_similarity
def retrieve_with_recency(query_embedding, memories, recency_weight=0.2):
"""Retrieve memories with recency boost."""
now = datetime.now()
scored = []
for memory in memories:
# Semantic similarity (0-1)
similarity = cosine_similarity(query_embedding, memory.embedding)
# Recency score (exponential decay)
age_days = (now - memory.created_at).days
recency = math.exp(-age_days / 30) # 30-day half-life
# Combined score
score = (1 - recency_weight) * similarity + recency_weight * recency
scored.append((memory, score))
return sorted(scored, key=lambda x: x[1], reverse=True)
Importance weighting prioritizes memories marked as more important during storage. Critical information (allergies, major deadlines) should surface even if not semantically similar to the current query.
Contextual retrieval considers not just the current message but the recent conversation flow. If the user has been discussing travel for the last five messages, travel-related memories should be boosted.
Tools and Frameworks
You don’t have to build memory systems from scratch. Several tools handle the complexity for you.
Mem0 (GitHub, Docs) is a dedicated memory layer for AI applications. It automatically extracts and compresses memories from conversations, claims up to 80% reduction in token usage, and integrates with major frameworks. The pitch is “add memory with one line of code.”
# mem0 v0.1.x (Feb 2026)
from mem0 import Memory
memory = Memory()
# Store from conversation
memory.add("I prefer window seats on flights", user_id="user_123")
# Retrieve relevant memories
memories = memory.search("booking a flight", user_id="user_123")
Letta (GitHub, Docs) — formerly MemGPT — takes a different approach where agents manage their own memory. The agent can explicitly decide to store, update, or delete memories using tool calls. This gives more control but requires the agent to be “memory-aware.”
LangGraph (GitHub, Docs) provides checkpointing and state persistence for agent workflows. It’s lower-level than Mem0 but gives you full control over what gets stored and when.
Vector databases provide the retrieval infrastructure for any RAG-based memory system:
- Pinecone (Docs) — managed, serverless
- Weaviate (GitHub) — open source, hybrid search
- Chroma (GitHub) — lightweight, great for local dev
- Qdrant (GitHub) — open source, production-ready
For most use cases, I’d recommend starting with Mem0 or a similar managed solution. Build custom only if you have specific requirements that off-the-shelf tools can’t handle.
Failure Modes
Memory systems can fail in several ways, and understanding these helps you build more robust solutions.
Memory pollution happens when incorrect or outdated information gets stored and keeps resurfacing. If the agent misunderstands something and stores it as a fact, that error persists. You need mechanisms to correct or delete bad memories.
Retrieval failures occur when the right memory exists but doesn’t get retrieved. This happens when the semantic similarity between the query and memory is low despite being relevant. Expanding retrieval with multiple query reformulations can help.
Contradiction handling is tricky when users change their minds. If someone says they’re vegetarian, then a year later says they eat fish now, the system needs to handle both memories intelligently—ideally updating or deprecating the old one.
Privacy leakage is a risk when memories from one context inappropriately surface in another. If a user shares sensitive information in a private conversation, that shouldn’t appear when they’re using the agent in a shared setting. For a deeper look at agent security risks, see Prompt Injection & Agent Security.
Context overflow still happens even with memory systems if you retrieve too many memories or generate summaries that are too long. You need to budget your context window across immediate conversation, retrieved memories, and system instructions.
When You Don’t Need Memory
Memory adds complexity, latency, and cost. It’s not always worth it.
Single-turn tasks like code generation, translation, or one-off questions don’t benefit from remembering past interactions.
Stateless APIs where each request should be independent by design don’t need memory. A customer service bot that routes tickets probably shouldn’t remember previous conversations.
Privacy-sensitive contexts might require that the agent explicitly not remember anything. Healthcare or legal applications might have compliance requirements around data retention.
High-volume, low-value interactions where storing memories would cost more than the value they provide. A bot handling millions of simple FAQ queries probably doesn’t need to remember each one.
The Bottom Line
Agent memory transforms a forgetful chatbot into a genuine assistant. The key insights are:
Short-term memory (current conversation) and long-term memory (persistent storage) require different strategies. Don’t try to solve both with the same approach.
Hybrid systems work best in practice. Combine sliding windows for immediate context, summarization for session history, and RAG retrieval for long-term recall.
What you store matters as much as how you store it. Extract facts, score importance, and be selective. Not every message deserves to be remembered.
Retrieval is half the battle. Semantic similarity alone isn’t enough—consider recency, importance, and conversational context.
Start with existing tools like Mem0 before building custom. Memory systems have subtle edge cases that battle-tested libraries handle better than naive implementations.
The goal isn’t perfect recall. It’s making the agent feel like it actually knows you—remembering the things that matter while gracefully forgetting the noise.