How AI Agents Remember: Different Approaches for Different Needs

Why some AI memory systems feel like they forget everything, and what it takes to build memory that actually works for exploratory work.

Disclosure: I’ve built a memory system using the layered approach described here.

Scope note: This approach works well for personal AI assistants — single-user context, exploratory work, no compliance requirements. Enterprise settings face different constraints: data retention policies, multi-user isolation, audit trails, GDPR compliance. These may restrict where this approach is viable. Controlled deletion, role-based access, and formal data governance aren’t impossible, but they add significant engineering complexity and may require architectural trade-offs that change the design.


You’re working with an AI assistant. You mention you’re working on a project. Next session, you mention it again. And again. Each time, the AI treats it as new information.

This isn’t a bug. It’s a design choice — and understanding the options reveals something important about how we think, not just how AI should work.


Why Understanding Matters

When exploring ideas with an AI, I’d expect it to build on previous conversations the way a colleague would — remembering what we discussed, connecting threads. It didn’t. That gap between expectation and reality drove me to understand how memory systems actually work.

I could have adopted an existing solution. But adoption alone wouldn’t teach me where these systems break, what trade-offs they make. In a field where new solutions appear weekly, understanding principles matters more than knowing any single implementation.

This article isn’t “build your own memory system.” It’s “understand how memory systems differ, so you can choose what fits your needs — or know what you’re building if you go custom.”


What Existing Systems Do

Most AI memory systems fall into a few categories:

The OS Model (MemGPT/Letta)

Treats context like RAM. The agent manages what stays “in memory” versus what goes to “disk storage.” When it needs something, it fetches it back.

Works well for: Extending context windows, managing limited tokens, letting agents self-organize their memory.

Less suited for: When you want control over what’s stored and how it’s organized.

The Knowledge Graph Model (Zep, LangMem)

Stores facts as entities and relationships. “User prefers dark mode.” “User is working on Project X.”

Works well for: Capturing explicit facts, building structured profiles, answering specific questions about users.

Less suited for: The messier process of developing understanding over time.

The Extraction Model

Watches conversations and extracts facts, entities, and relationships. “The user prefers dark mode.” “The user’s deadline is Friday.”

Works well for: Projects with clear milestones, task-oriented assistants, capturing explicit commitments, customer support bots.

Less suited for: Exploratory work where understanding emerges over multiple conversations.


Important caveat: These categories are simplifications. MemGPT also supports user-editable core memory blocks. Zep tracks temporal evolution of context. Real systems have more nuance than any taxonomy can capture.


What These Systems Assume

Each approach makes an assumption about when memory happens:

  • Automatic extraction during conversation. The system identifies “important” information and stores it without human oversight.
  • Decisions captured mid-process. Often before you’ve had time to reflect.
  • Knowledge structured as facts. Entities and relationships, not patterns and explorations.

These assumptions work well for many use cases. If you’re building a customer support bot, automatic extraction is exactly right — you want to capture that the customer’s issue was X and it was resolved with Y.

But they don’t match how exploration works.


A Different Approach: Layered Memory

One possible framework — there are others, and your needs may differ.

When I looked at how I actually work on something complex, I found a different pattern:

Different categories, different purposes.

Not “important vs unimportant” — but different kinds of memory serving different needs. This is one taxonomy that works for me; it’s not meant to be complete:

Category Purpose Example
Conversations Session flow, context “We discussed authentication approaches”
Projects Active work, decisions “EPM Audit CLI is in Phase 7”
Notes Structured topics “Model comparison: GLM-5 vs Kimi K2.5”
Decisions Crystallized insights “Use PydanticAI for type safety”
Documents Formal knowledge “OCI IAM vs IDCS architecture”

Each category has different retention, different access patterns, different purposes. You might need different categories — Preferences, Skills, Mistakes to avoid — depending on your work.

Hybrid approaches exist.

This isn’t “delayed is better than immediate.” Some systems combine both: initial capture for critical items (deadlines, commitments), with refinement cycles for deeper understanding. The question isn’t which is superior — it’s what your use case needs.

Aging matters.

Things from years ago may no longer be accurate. “Eggs are bad for health” — replaced by new research. “Use technology X” — replaced by technology Y.

Memory isn’t just about storing. It’s about maintaining relevance over time. But detecting staleness is hard: time-based decay works for some information, contradiction detection for others, and explicit invalidation for clear-cut cases. This is an unsolved problem.

Distillation happens after reflection.

Decisions aren’t extracted mid-conversation. They emerge after revisiting from multiple angles — code review, document review, exploration — until patterns become clear.


How This Actually Works: A Real Example

Let me show you the pieces that make this work in practice:

fmem — Local semantic search over all indexed content. Past conversations, notes, decisions — when a topic comes up, fmem surfaces relevant chunks regardless of when they were written. A passing comment from two months ago might suddenly matter.

How it ranks results: Semantic similarity (50%), recency (30%), and location importance (20%). Documentation and decisions rank higher than casual chats. This is for retrieval only — distillation is a separate process.

Daily files — Every session writes to memory/YYYY-MM-DD.md. Raw, uncurated, everything. Loaded in context for the current day(s), then archived and retrieved via fmem when relevant later.

MEMORY.md — The curated profile. ~500 tokens of preferences, active projects, key patterns. Always loaded, never archived.

Cron job — Every night at 2am, an agent reads the past few days of daily files and updates MEMORY.md with distilled patterns. One-liner entries only — “Use PydanticAI for audit automation” — with pointers to notes/ for details.

Notes folder — Structured by topic, not time. Notes come from two sources: distilled from reflection (crystallizing insights from daily files) or captured from recognition (when structure emerges in conversation). These aren’t decisions extracted mid-stream — they’re intermediate patterns worth preserving. I save them immediately, not to finalize, but to hold the shape of an exploration before it dissolves.

Decisions folder — Final decisions with context. When a note becomes stable knowledge, it becomes a decision.

Monthly maintenance — Another cron job checks if notes/ has grown too large. If so, it offers to consolidate: group related notes, archive old ones, update the index.

This isn’t one tool. It’s a pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Daily files (raw)
fmem indexes everything
Cron distills patterns → MEMORY.md
Topics emerge → notes/
Decisions crystallize → decisions/
Monthly review cleans up

The key insight: the system doesn’t auto-extract during conversation. It captures everything raw, then distills later with supervision. But you can intentionally capture — when structure emerges, saving it as a note preserves the exploration. The difference is intention: automatic extraction vs. deliberate recognition.


A Concrete Example — And Where It Fails

Here’s how this works in practice — and where it fails:

Day 1: You discuss three different approaches to authentication for a project. The conversation is saved in daily memory. No decision is extracted — you were exploring.

Day 5: You’re reviewing the project notes and add to a structured note: “Authentication approaches compared: OAuth2, API keys, mTLS. Leaning toward OAuth2 for external, API keys for internal.”

Day 12: After implementation attempts and discussions, you write a decision: “Use OAuth2 for external APIs, API keys for internal services. mTLS adds complexity without benefit for our use case.”

The decision wasn’t available on Day 1. It emerged from exploration, comparison, and implementation attempts. Extraction on Day 1 would have captured the wrong thing.

But this approach can also fail:

  • Wrong crystallization: If Day 5 notes captured a misunderstanding, Day 12 builds on it incorrectly. Delayed distillation doesn’t guarantee correctness — it can entrench errors.
  • Lost context: Important details from Day 1 might not make it to notes. The gap between raw capture and crystallized decision is where context leaks.
  • Unnecessary delay: Sometimes the decision IS available on Day 1. Waiting 12 days to crystallize what you already knew isn’t wisdom — it’s overhead.
  • Confirmation bias: Multiple passes can amplify patterns that aren’t really there, especially if you’re revisiting the same assumptions.

No approach is magic. Each has failure modes.


What Good Memory Looks Like

Captures everything, categorizes by purpose. Not “important vs unimportant” — “working context vs archive vs decision.”

Searches semantically. “How did we approach authentication?” finds relevant notes even if the exact words differ.

Distills after reflection. Decisions emerge from notes. Notes capture exploration. Daily logs capture everything.

Ages gracefully. Old information gets replaced. What was true may no longer be. (Detecting what’s stale is the hard part.)

Works at the chunk level. A section within a document is the right granularity for many use cases. Full files are too noisy. Keywords are too sparse. (Though different retrieval tasks may need different granularities — this is still a research question.)

Has privacy boundaries. Not everything should be remembered. Some contexts should stay isolated. GDPR and data retention policies matter.


Engineering Reality

The core idea is straightforward: write everything to daily files, index them with semantic search, distill patterns into a profile, organize topics into notes. A cron job, a search index, a structured folder system.

What makes this work isn’t complexity — it’s the separation of concerns. Daily files capture raw. Notes organize by topic. Decisions crystallize insights. Each layer has one job.

The daily files should bloat. That’s the point — they’re the raw material for distillation. You write everything because you don’t know what matters yet. A passing observation today might become relevant in three months when a new context emerges. fmem indexes it all, so the past stays searchable and can surface when patterns connect.

For modern agents, implementing this isn’t hard. The challenge isn’t technical — it’s knowing what to distill, when, and how to keep it useful over time.


Other Approaches Worth Knowing

This isn’t exhaustive. Other memory systems include:

  • Mem0 — User-facing memory with profiles and preferences
  • Cognee — Knowledge graph memory for AI applications
  • Supermemory — Cloud-based memory layer
  • OpenAI Responses API — Built-in memory for conversations

Each has different trade-offs. The landscape is evolving quickly.


The Key Insight

Different use cases need different memory approaches.

If you’re building a support bot, extraction is exactly right. Capture the issue, track the resolution, build a knowledge base.

If you’re working alongside an AI on complex problems, you might want something different: a system that preserves the messy process of exploration, lets you revisit from different angles, and crystallizes decisions when patterns become clear.

Both are correct. They just serve different needs. Some systems combine both — initial capture for critical items, refinement cycles for deeper understanding.


Postscript

The conversations that matter most aren’t always the ones where you made a decision. They’re often the ones where you explored an idea from multiple angles, and something clicked later.

Good memory systems should preserve that process — for those who need it.