Skip to content

Memory & Persistence

retrievalagent is built on LangGraph and supports two levels of memory:

checkpointer= memory_store=
Scope Per thread (conversation) Per user (cross-thread)
What's stored Full graph state Key Q&A facts
Survives restarts With SQLite/Postgres With SQLite/Postgres
Use case Resume a conversation Remember user preferences across sessions

For simple multi-turn chat within a single session, the history= parameter on chat() is enough — no config needed.

In-process memory (MemorySaver)

Lost on restart. Good for single-session apps or testing.

from retrievalagent import init_agent
from langgraph.checkpoint.memory import MemorySaver

rag = init_agent(
    "docs",
    model="openai:gpt-5.4",
    backend="qdrant",
    backend_url="http://localhost:6333",
    checkpointer=MemorySaver(),
)

config = {"configurable": {"thread_id": "user-alice"}}

state = rag.invoke("What is hybrid search?", config=config)
state = rag.invoke("Give me an example.", config=config)   # graph remembers the first turn

Persistent memory (SQLite)

Survives restarts. Good for chatbots and long-running apps.

from retrievalagent import init_agent
from langgraph.checkpoint.sqlite import SqliteSaver

with SqliteSaver.from_conn_string("./memory.db") as checkpointer:
    rag = init_agent(
        "docs",
        model="openai:gpt-5.4",
        backend="qdrant",
        backend_url="http://localhost:6333",
        checkpointer=checkpointer,
    )

    config = {"configurable": {"thread_id": "user-alice"}}
    state = rag.invoke("What is hybrid search?", config=config)

Persistent memory (PostgreSQL)

Production-grade. Requires pip install langgraph-checkpoint-postgres.

from retrievalagent import init_agent
from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(
    "postgresql://user:pass@localhost/mydb"
)
checkpointer.setup()   # creates the checkpoint tables on first run

rag = init_agent(
    "docs",
    model="openai:gpt-5.4",
    backend="qdrant",
    checkpointer=checkpointer,
)

config = {"configurable": {"thread_id": "user-alice"}}
state = rag.invoke("What is hybrid search?", config=config)

Thread IDs

Each thread_id is an independent memory scope. Use one per user, session, or conversation:

# Different users — separate memory
rag.invoke("Who are you?", config={"configurable": {"thread_id": "user-alice"}})
rag.invoke("Who are you?", config={"configurable": {"thread_id": "user-bob"}})

Search-knowledge memory (mem0)

mem0 is the long-term memory backend. In retrievalagent it stores corpus search facts — reusable term mappings (synonyms, aliases, brand spellings, common typos, informal-to-formal pairs) that improve future retrieval on the same corpus. It is not a user-preferences store.

What gets memorised: short, corpus-general mappings between terms a user typed and the surface form the matching documents used.

What does not get stored: trivial substring matches, single-document facts, anything specific to one user's identity or workflow.

Storage is gated by an LLM decision: the final-grade node, after reading the question / retrieved snippets / generated answer, returns a structured memory_worth_storing boolean and a memory_confidence score. Storage only fires when both memory_worth_storing=True AND memory_confidence >= memory_storage_threshold (default 0.85). This keeps the long-term store small and high-signal.

How retrievalagent wires it in

The LangGraph state machine has two memory nodes — read_memory runs before retrieval, write_memory runs after generation. When you pass mem0_memory= to the agent, retrievalagent routes both nodes through mem0:

Node mem0 call What retrievalagent does
read_memory search(question, filters={"user_id": ...}) Filters hits by relevance score; survivors land in state.memory_context.
write_memory add(messages, user_id=...) after the answer mem0's LLM extracts facts; conflicts get resolved.

Both nodes read user_id from the request config (config["configurable"]["user_id"]). If a request arrives without a user_id, retrievalagent falls back to "default".

Async-vs-sync detection happens automatically — pass AsyncMemory() and retrievalagent uses asearch/aadd directly; pass Memory() and retrievalagent runs the sync calls in a thread pool so the event loop stays free.

Two thresholds

Threshold Env Default Controls
memory_relevance_threshold RAG_MEMORY_RELEVANCE_THRESHOLD 0.7 mem0 cosine score the recall must exceed before a stored fact reaches retrieval/generation.
memory_storage_threshold RAG_MEMORY_STORAGE_THRESHOLD 0.85 LLM memory_confidence the grader must report before a new fact is written.
rag = init_agent("docs", model="openai:gpt-5.4", memory=True)
# Override:
# export RAG_MEMORY_RELEVANCE_THRESHOLD=0.6
# export RAG_MEMORY_STORAGE_THRESHOLD=0.9

Recall, when something clears the relevance gate, is injected in two places:

  1. Retrieval — the flattened memory text becomes one extra BM25 term so the search picks up the synonymous surface form.
  2. Generation — the system prompt is prefixed with "Known search hints for this corpus (synonyms / aliases ...):\n<memories>" so the LLM can use the mapping to write the answer.

Vector search is not touched — embedding question + memories together would dilute the question vector. The split is intentional: BM25 carries lexical signal, the LLM carries the synonym hint.

Writes are fire-and-forget: the graph schedules the mem0.add(...) call as a background asyncio task and returns immediately, so the user-facing response is never blocked on memory I/O. Call await rag.adrain_background() before shutdown if you need the writes to land before the process exits.

Inspecting what happened

state.memory_context is the exact text injected for this turn (empty when nothing cleared the relevance bar). state.memory_fact is the fact the grader chose to store on this turn (empty when the grader said no). state.trace carries shaped events:

# Recall fired:
{"node": "read_memory", "memories": "- <stored mapping>",
 "n_kept": 1, "n_scanned": 3, "threshold": 0.7}

# Recall scanned but nothing cleared the bar:
{"node": "read_memory", "skipped": "below_threshold",
 "threshold": 0.7, "best_score": 0.62, "n_scanned": 3}

# Final grade decided to store:
{"node": "final_grade", "sufficient": True, "confidence": 0.9,
 "memory_should_store": True, "memory_confidence": 0.92}

# Final grade decided NOT to store:
{"node": "final_grade", "sufficient": True, "confidence": 0.85,
 "memory_should_store": False, "memory_confidence": 0.0}

Tune by reading these — if best_score is consistently 0.65 for valid recalls, lower the relevance threshold; if too many trivial facts are getting stored, raise the storage threshold.

Install

pip install mem0ai

Minimal example

from retrievalagent import init_agent
from mem0 import Memory  # or AsyncMemory

rag = init_agent(
    "docs",
    model="openai:gpt-5.4",
    backend="qdrant",
    backend_url="http://localhost:6333",
    mem0_memory=Memory(),
)

config = {"configurable": {"user_id": "alice"}}

rag.invoke("I prefer answers in German.", config=config)
rag.invoke("What is hybrid search?", config=config)
# mem0 extracted the language preference and recalled it on the second call.

Configuring mem0's own backends

Memory() defaults to OpenAI embeddings + an embedded vector DB. To point it at the same vector store you already use for retrieval, pass mem0 a config dict:

from mem0 import Memory

mem = Memory.from_config({
    "vector_store": {
        "provider": "qdrant",
        "config": {"host": "localhost", "port": 6333, "collection_name": "user_memories"},
    },
    "llm": {
        "provider": "openai",
        "config": {"model": "gpt-5.4-mini"},
    },
    "embedder": {
        "provider": "openai",
        "config": {"model": "text-embedding-3-small"},
    },
})

rag = init_agent("docs", model="openai:gpt-5.4", backend="qdrant", mem0_memory=mem)

See the mem0 docs for the full provider list (Anthropic, Azure OpenAI, Postgres + pgvector, Chroma, etc.).

Inspecting recalled memories

state.trace contains a read_memory entry whenever mem0 returned hits:

state = rag.invoke("What is hybrid search?", config={"configurable": {"user_id": "alice"}})
for step in state.trace:
    if step["node"] == "read_memory":
        print("Recalled facts:\n", step["memories"])

Manually scoping users

user_id partitions all memory ops. Two ways to set it:

# 1. Via the per-call config (most common)
rag.invoke(question, config={"configurable": {"user_id": "alice"}})

# 2. Or bake a default into the agent and let mem0 see it
import functools
ainvoke = functools.partial(rag.ainvoke, config={"configurable": {"user_id": "alice"}})

Failure modes & gotchas

  • No mem0 LLM → mem0 still stores raw turns but skips fact extraction; quality matches memory_store (raw Q&A). Configure mem0's own LLM to get the deduplication benefit.
  • Slow first call → mem0's first call seeds embeddings; expect ~1–2 s extra latency on cold start. Subsequent calls are fast.
  • Async event loop already running (Jupyter, FastAPI handlers) → use AsyncMemory(). retrievalagent detects aadd/asearch and avoids the thread-pool round-trip.
  • mem0 errors are swallowed — retrievalagent's memory nodes catch exceptions silently so memory hiccups never break the QA path. Check state.trace for missing read_memory events when debugging.

Combining with checkpointer

mem0_memory= is orthogonal to checkpointer=. The checkpointer persists graph state per thread_id (resume a conversation); mem0 persists extracted facts per user_id (carry preferences across sessions). Pass both for full coverage.


Long-term memory (memory_store)

Cross-thread memory that persists facts across different conversations and users. After each answer the agent writes a Q&A summary; before each retrieval it reads relevant past exchanges and uses them as context.

from retrievalagent import init_agent
from langgraph.store.memory import InMemoryStore

rag = init_agent(
    "docs",
    model="openai:gpt-5.4",
    backend="qdrant",
    backend_url="http://localhost:6333",
    memory_store=InMemoryStore(),
)

# Scope memories to a user with user_id
config = {"configurable": {"user_id": "alice"}}

rag.invoke("I prefer answers in German.", config=config)
rag.invoke("What is hybrid search?", config=config)
# Second call remembers the language preference from the first

Combine both for full memory:

from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.store.memory import InMemoryStore

rag = init_agent(
    "docs",
    model="openai:gpt-5.4",
    backend="qdrant",
    checkpointer=SqliteSaver.from_conn_string("./memory.db"),
    memory_store=InMemoryStore(),
)

config = {"configurable": {"thread_id": "session-1", "user_id": "alice"}}
state = rag.invoke("What is hybrid search?", config=config)

# state.trace includes a 'read_memory' entry showing what was recalled
for step in state.trace:
    if step["node"] == "read_memory":
        print("Recalled:", step.get("memories"))

For production, replace InMemoryStore with AsyncPostgresStore:

from langgraph.store.postgres import AsyncPostgresStore

store = AsyncPostgresStore.from_conn_string("postgresql://user:pass@localhost/mydb")
await store.setup()  # creates tables on first run

rag = init_agent("docs", model="openai:gpt-5.4", memory_store=store)

When to use memory vs history

history= on chat() checkpointer= memory_store= mem0_memory=
Scope Single session Per thread Per user (cross-thread) Per user (cross-thread)
What's stored Answer text Full graph state Raw Q&A strings Extracted facts
Deduplication No Yes (LLM-based)
Survives restarts No With SQLite/Postgres With Postgres store With mem0 store
Use case Simple multi-turn Resumable chatbots Basic long-term context Smart user preferences
Config key (none) thread_id user_id user_id

Tip

Combine all for full coverage: history= for the current turn, checkpointer= to resume the thread, mem0_memory= to recall extracted facts from previous sessions.