Memory & Persistence

retrievalagent is built on LangGraph and supports two levels of memory:

	`checkpointer=`	`memory_store=`
Scope	Per thread (conversation)	Per user (cross-thread)
What's stored	Full graph state	Key Q&A facts
Survives restarts	With SQLite/Postgres	With SQLite/Postgres
Use case	Resume a conversation	Remember user preferences across sessions

For simple multi-turn chat within a single session, the history= parameter on chat() is enough — no config needed.

In-process memory (MemorySaver)

Lost on restart. Good for single-session apps or testing.

from retrievalagent import init_agent
from langgraph.checkpoint.memory import MemorySaver

rag = init_agent(
    "docs",
    model="openai:gpt-5.4",
    backend="qdrant",
    backend_url="http://localhost:6333",
    checkpointer=MemorySaver(),
)

config = {"configurable": {"thread_id": "user-alice"}}

state = rag.invoke("What is hybrid search?", config=config)
state = rag.invoke("Give me an example.", config=config)   # graph remembers the first turn

Persistent memory (SQLite)

Survives restarts. Good for chatbots and long-running apps.

from retrievalagent import init_agent
from langgraph.checkpoint.sqlite import SqliteSaver

with SqliteSaver.from_conn_string("./memory.db") as checkpointer:
    rag = init_agent(
        "docs",
        model="openai:gpt-5.4",
        backend="qdrant",
        backend_url="http://localhost:6333",
        checkpointer=checkpointer,
    )

    config = {"configurable": {"thread_id": "user-alice"}}
    state = rag.invoke("What is hybrid search?", config=config)

Persistent memory (PostgreSQL)

Production-grade. Requires pip install langgraph-checkpoint-postgres.

from retrievalagent import init_agent
from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(
    "postgresql://user:pass@localhost/mydb"
)
checkpointer.setup()   # creates the checkpoint tables on first run

rag = init_agent(
    "docs",
    model="openai:gpt-5.4",
    backend="qdrant",
    checkpointer=checkpointer,
)

config = {"configurable": {"thread_id": "user-alice"}}
state = rag.invoke("What is hybrid search?", config=config)

Thread IDs

Each thread_id is an independent memory scope. Use one per user, session, or conversation:

# Different users — separate memory
rag.invoke("Who are you?", config={"configurable": {"thread_id": "user-alice"}})
rag.invoke("Who are you?", config={"configurable": {"thread_id": "user-bob"}})

Search-knowledge memory (mem0)

mem0 is the long-term memory backend. In retrievalagent it stores corpus search facts — reusable term mappings (synonyms, aliases, brand spellings, common typos, informal-to-formal pairs) that improve future retrieval on the same corpus. It is not a user-preferences store.

What gets memorised: short, corpus-general mappings between terms a user typed and the surface form the matching documents used.

What does not get stored: trivial substring matches, single-document facts, anything specific to one user's identity or workflow.

Storage is gated by an LLM decision: the final-grade node, after reading the question / retrieved snippets / generated answer, returns a structured memory_worth_storing boolean and a memory_confidence score. Storage only fires when both memory_worth_storing=True AND memory_confidence >= memory_storage_threshold (default 0.85). This keeps the long-term store small and high-signal.

How retrievalagent wires it in

The LangGraph state machine has two memory nodes — read_memory runs before retrieval, write_memory runs after generation. When you pass mem0_memory= to the agent, retrievalagent routes both nodes through mem0:

Node	mem0 call	What retrievalagent does
`read_memory`	`search(question, filters={"user_id": ...})`	Filters hits by relevance score; survivors land in `state.memory_context`.
`write_memory`	`add(messages, user_id=...)` after the answer	mem0's LLM extracts facts; conflicts get resolved.

Both nodes read user_id from the request config (config["configurable"]["user_id"]). If a request arrives without a user_id, retrievalagent falls back to "default".

Async-vs-sync detection happens automatically — pass AsyncMemory() and retrievalagent uses asearch/aadd directly; pass Memory() and retrievalagent runs the sync calls in a thread pool so the event loop stays free.

Two thresholds

Threshold	Env	Default	Controls
`memory_relevance_threshold`	`RAG_MEMORY_RELEVANCE_THRESHOLD`	`0.7`	mem0 cosine score the recall must exceed before a stored fact reaches retrieval/generation.
`memory_storage_threshold`	`RAG_MEMORY_STORAGE_THRESHOLD`	`0.85`	LLM `memory_confidence` the grader must report before a new fact is written.

rag = init_agent("docs", model="openai:gpt-5.4", memory=True)
# Override:
# export RAG_MEMORY_RELEVANCE_THRESHOLD=0.6
# export RAG_MEMORY_STORAGE_THRESHOLD=0.9

Recall, when something clears the relevance gate, is injected in two places:

Retrieval — the flattened memory text becomes one extra BM25 term so the search picks up the synonymous surface form.
Generation — the system prompt is prefixed with "Known search hints for this corpus (synonyms / aliases ...):\n<memories>" so the LLM can use the mapping to write the answer.

Vector search is not touched — embedding question + memories together would dilute the question vector. The split is intentional: BM25 carries lexical signal, the LLM carries the synonym hint.

Writes are fire-and-forget: the graph schedules the mem0.add(...) call as a background asyncio task and returns immediately, so the user-facing response is never blocked on memory I/O. Call await rag.adrain_background() before shutdown if you need the writes to land before the process exits.

Inspecting what happened

state.memory_context is the exact text injected for this turn (empty when nothing cleared the relevance bar). state.memory_fact is the fact the grader chose to store on this turn (empty when the grader said no). state.trace carries shaped events:

# Recall fired:
{"node": "read_memory", "memories": "- <stored mapping>",
 "n_kept": 1, "n_scanned": 3, "threshold": 0.7}

# Recall scanned but nothing cleared the bar:
{"node": "read_memory", "skipped": "below_threshold",
 "threshold": 0.7, "best_score": 0.62, "n_scanned": 3}

# Final grade decided to store:
{"node": "final_grade", "sufficient": True, "confidence": 0.9,
 "memory_should_store": True, "memory_confidence": 0.92}

# Final grade decided NOT to store:
{"node": "final_grade", "sufficient": True, "confidence": 0.85,
 "memory_should_store": False, "memory_confidence": 0.0}

Tune by reading these — if best_score is consistently 0.65 for valid recalls, lower the relevance threshold; if too many trivial facts are getting stored, raise the storage threshold.

Install

pip install mem0ai

Minimal example

from retrievalagent import init_agent
from mem0 import Memory  # or AsyncMemory

rag = init_agent(
    "docs",
    model="openai:gpt-5.4",
    backend="qdrant",
    backend_url="http://localhost:6333",
    mem0_memory=Memory(),
)

config = {"configurable": {"user_id": "alice"}}

rag.invoke("I prefer answers in German.", config=config)
rag.invoke("What is hybrid search?", config=config)
# mem0 extracted the language preference and recalled it on the second call.

Configuring mem0's own backends

Memory() defaults to OpenAI embeddings + an embedded vector DB. To point it at the same vector store you already use for retrieval, pass mem0 a config dict:

from mem0 import Memory

mem = Memory.from_config({
    "vector_store": {
        "provider": "qdrant",
        "config": {"host": "localhost", "port": 6333, "collection_name": "user_memories"},
    },
    "llm": {
        "provider": "openai",
        "config": {"model": "gpt-5.4-mini"},
    },
    "embedder": {
        "provider": "openai",
        "config": {"model": "text-embedding-3-small"},
    },
})

rag = init_agent("docs", model="openai:gpt-5.4", backend="qdrant", mem0_memory=mem)

See the mem0 docs for the full provider list (Anthropic, Azure OpenAI, Postgres + pgvector, Chroma, etc.).

Inspecting recalled memories

state.trace contains a read_memory entry whenever mem0 returned hits:

state = rag.invoke("What is hybrid search?", config={"configurable": {"user_id": "alice"}})
for step in state.trace:
    if step["node"] == "read_memory":
        print("Recalled facts:\n", step["memories"])

Manually scoping users

user_id partitions all memory ops. Two ways to set it:

# 1. Via the per-call config (most common)
rag.invoke(question, config={"configurable": {"user_id": "alice"}})

# 2. Or bake a default into the agent and let mem0 see it
import functools
ainvoke = functools.partial(rag.ainvoke, config={"configurable": {"user_id": "alice"}})

Failure modes & gotchas

No mem0 LLM → mem0 still stores raw turns but skips fact extraction; quality matches memory_store (raw Q&A). Configure mem0's own LLM to get the deduplication benefit.
Slow first call → mem0's first call seeds embeddings; expect ~1–2 s extra latency on cold start. Subsequent calls are fast.
Async event loop already running (Jupyter, FastAPI handlers) → use AsyncMemory(). retrievalagent detects aadd/asearch and avoids the thread-pool round-trip.
mem0 errors are swallowed — retrievalagent's memory nodes catch exceptions silently so memory hiccups never break the QA path. Check state.trace for missing read_memory events when debugging.

Combining with checkpointer

mem0_memory= is orthogonal to checkpointer=. The checkpointer persists graph state per thread_id (resume a conversation); mem0 persists extracted facts per user_id (carry preferences across sessions). Pass both for full coverage.

Long-term memory (memory_store)

Cross-thread memory that persists facts across different conversations and users. After each answer the agent writes a Q&A summary; before each retrieval it reads relevant past exchanges and uses them as context.

from retrievalagent import init_agent
from langgraph.store.memory import InMemoryStore

rag = init_agent(
    "docs",
    model="openai:gpt-5.4",
    backend="qdrant",
    backend_url="http://localhost:6333",
    memory_store=InMemoryStore(),
)

# Scope memories to a user with user_id
config = {"configurable": {"user_id": "alice"}}

rag.invoke("I prefer answers in German.", config=config)
rag.invoke("What is hybrid search?", config=config)
# Second call remembers the language preference from the first

Combine both for full memory:

from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.store.memory import InMemoryStore

rag = init_agent(
    "docs",
    model="openai:gpt-5.4",
    backend="qdrant",
    checkpointer=SqliteSaver.from_conn_string("./memory.db"),
    memory_store=InMemoryStore(),
)

config = {"configurable": {"thread_id": "session-1", "user_id": "alice"}}
state = rag.invoke("What is hybrid search?", config=config)

# state.trace includes a 'read_memory' entry showing what was recalled
for step in state.trace:
    if step["node"] == "read_memory":
        print("Recalled:", step.get("memories"))

For production, replace InMemoryStore with AsyncPostgresStore:

from langgraph.store.postgres import AsyncPostgresStore

store = AsyncPostgresStore.from_conn_string("postgresql://user:pass@localhost/mydb")
await store.setup()  # creates tables on first run

rag = init_agent("docs", model="openai:gpt-5.4", memory_store=store)

When to use memory vs history

	`history=` on `chat()`	`checkpointer=`	`memory_store=`	`mem0_memory=`
Scope	Single session	Per thread	Per user (cross-thread)	Per user (cross-thread)
What's stored	Answer text	Full graph state	Raw Q&A strings	Extracted facts
Deduplication	—	—	No	Yes (LLM-based)
Survives restarts	No	With SQLite/Postgres	With Postgres store	With mem0 store
Use case	Simple multi-turn	Resumable chatbots	Basic long-term context	Smart user preferences
Config key	(none)	`thread_id`	`user_id`	`user_id`

Tip

Combine all for full coverage: history= for the current turn, checkpointer= to resume the thread, mem0_memory= to recall extracted facts from previous sessions.