Memory & Persistence
retrievalagent is built on LangGraph and supports two levels of memory:
checkpointer= |
memory_store= |
|
|---|---|---|
| Scope | Per thread (conversation) | Per user (cross-thread) |
| What's stored | Full graph state | Key Q&A facts |
| Survives restarts | With SQLite/Postgres | With SQLite/Postgres |
| Use case | Resume a conversation | Remember user preferences across sessions |
For simple multi-turn chat within a single session, the history= parameter on chat() is enough — no config needed.
In-process memory (MemorySaver)
Lost on restart. Good for single-session apps or testing.
from retrievalagent import init_agent
from langgraph.checkpoint.memory import MemorySaver
rag = init_agent(
"docs",
model="openai:gpt-5.4",
backend="qdrant",
backend_url="http://localhost:6333",
checkpointer=MemorySaver(),
)
config = {"configurable": {"thread_id": "user-alice"}}
state = rag.invoke("What is hybrid search?", config=config)
state = rag.invoke("Give me an example.", config=config) # graph remembers the first turn
Persistent memory (SQLite)
Survives restarts. Good for chatbots and long-running apps.
from retrievalagent import init_agent
from langgraph.checkpoint.sqlite import SqliteSaver
with SqliteSaver.from_conn_string("./memory.db") as checkpointer:
rag = init_agent(
"docs",
model="openai:gpt-5.4",
backend="qdrant",
backend_url="http://localhost:6333",
checkpointer=checkpointer,
)
config = {"configurable": {"thread_id": "user-alice"}}
state = rag.invoke("What is hybrid search?", config=config)
Persistent memory (PostgreSQL)
Production-grade. Requires pip install langgraph-checkpoint-postgres.
from retrievalagent import init_agent
from langgraph.checkpoint.postgres import PostgresSaver
checkpointer = PostgresSaver.from_conn_string(
"postgresql://user:pass@localhost/mydb"
)
checkpointer.setup() # creates the checkpoint tables on first run
rag = init_agent(
"docs",
model="openai:gpt-5.4",
backend="qdrant",
checkpointer=checkpointer,
)
config = {"configurable": {"thread_id": "user-alice"}}
state = rag.invoke("What is hybrid search?", config=config)
Thread IDs
Each thread_id is an independent memory scope. Use one per user, session, or conversation:
# Different users — separate memory
rag.invoke("Who are you?", config={"configurable": {"thread_id": "user-alice"}})
rag.invoke("Who are you?", config={"configurable": {"thread_id": "user-bob"}})
Search-knowledge memory (mem0)
mem0 is the long-term memory backend. In retrievalagent it stores corpus search facts — reusable term mappings (synonyms, aliases, brand spellings, common typos, informal-to-formal pairs) that improve future retrieval on the same corpus. It is not a user-preferences store.
What gets memorised: short, corpus-general mappings between terms a user typed and the surface form the matching documents used.
What does not get stored: trivial substring matches, single-document facts, anything specific to one user's identity or workflow.
Storage is gated by an LLM decision: the final-grade node, after
reading the question / retrieved snippets / generated answer, returns
a structured memory_worth_storing boolean and a memory_confidence
score. Storage only fires when both memory_worth_storing=True AND
memory_confidence >= memory_storage_threshold (default 0.85). This
keeps the long-term store small and high-signal.
How retrievalagent wires it in
The LangGraph state machine has two memory nodes — read_memory runs
before retrieval, write_memory runs after generation. When you pass
mem0_memory= to the agent, retrievalagent routes both nodes through mem0:
| Node | mem0 call | What retrievalagent does |
|---|---|---|
read_memory |
search(question, filters={"user_id": ...}) |
Filters hits by relevance score; survivors land in state.memory_context. |
write_memory |
add(messages, user_id=...) after the answer |
mem0's LLM extracts facts; conflicts get resolved. |
Both nodes read user_id from the request config
(config["configurable"]["user_id"]). If a request arrives without a
user_id, retrievalagent falls back to "default".
Async-vs-sync detection happens automatically — pass AsyncMemory()
and retrievalagent uses asearch/aadd directly; pass Memory() and retrievalagent
runs the sync calls in a thread pool so the event loop stays free.
Two thresholds
| Threshold | Env | Default | Controls |
|---|---|---|---|
memory_relevance_threshold |
RAG_MEMORY_RELEVANCE_THRESHOLD |
0.7 |
mem0 cosine score the recall must exceed before a stored fact reaches retrieval/generation. |
memory_storage_threshold |
RAG_MEMORY_STORAGE_THRESHOLD |
0.85 |
LLM memory_confidence the grader must report before a new fact is written. |
rag = init_agent("docs", model="openai:gpt-5.4", memory=True)
# Override:
# export RAG_MEMORY_RELEVANCE_THRESHOLD=0.6
# export RAG_MEMORY_STORAGE_THRESHOLD=0.9
Recall, when something clears the relevance gate, is injected in two places:
- Retrieval — the flattened memory text becomes one extra BM25 term so the search picks up the synonymous surface form.
- Generation — the system prompt is prefixed with
"Known search hints for this corpus (synonyms / aliases ...):\n<memories>"so the LLM can use the mapping to write the answer.
Vector search is not touched — embedding question + memories together would dilute the question vector. The split is intentional: BM25 carries lexical signal, the LLM carries the synonym hint.
Writes are fire-and-forget: the graph schedules the mem0.add(...)
call as a background asyncio task and returns immediately, so the
user-facing response is never blocked on memory I/O. Call
await rag.adrain_background() before shutdown if you need the
writes to land before the process exits.
Inspecting what happened
state.memory_context is the exact text injected for this turn
(empty when nothing cleared the relevance bar). state.memory_fact
is the fact the grader chose to store on this turn (empty when the
grader said no). state.trace carries shaped events:
# Recall fired:
{"node": "read_memory", "memories": "- <stored mapping>",
"n_kept": 1, "n_scanned": 3, "threshold": 0.7}
# Recall scanned but nothing cleared the bar:
{"node": "read_memory", "skipped": "below_threshold",
"threshold": 0.7, "best_score": 0.62, "n_scanned": 3}
# Final grade decided to store:
{"node": "final_grade", "sufficient": True, "confidence": 0.9,
"memory_should_store": True, "memory_confidence": 0.92}
# Final grade decided NOT to store:
{"node": "final_grade", "sufficient": True, "confidence": 0.85,
"memory_should_store": False, "memory_confidence": 0.0}
Tune by reading these — if best_score is consistently 0.65 for
valid recalls, lower the relevance threshold; if too many trivial
facts are getting stored, raise the storage threshold.
Install
Minimal example
from retrievalagent import init_agent
from mem0 import Memory # or AsyncMemory
rag = init_agent(
"docs",
model="openai:gpt-5.4",
backend="qdrant",
backend_url="http://localhost:6333",
mem0_memory=Memory(),
)
config = {"configurable": {"user_id": "alice"}}
rag.invoke("I prefer answers in German.", config=config)
rag.invoke("What is hybrid search?", config=config)
# mem0 extracted the language preference and recalled it on the second call.
Configuring mem0's own backends
Memory() defaults to OpenAI embeddings + an embedded vector DB. To
point it at the same vector store you already use for retrieval, pass
mem0 a config dict:
from mem0 import Memory
mem = Memory.from_config({
"vector_store": {
"provider": "qdrant",
"config": {"host": "localhost", "port": 6333, "collection_name": "user_memories"},
},
"llm": {
"provider": "openai",
"config": {"model": "gpt-5.4-mini"},
},
"embedder": {
"provider": "openai",
"config": {"model": "text-embedding-3-small"},
},
})
rag = init_agent("docs", model="openai:gpt-5.4", backend="qdrant", mem0_memory=mem)
See the mem0 docs for the full provider list (Anthropic, Azure OpenAI, Postgres + pgvector, Chroma, etc.).
Inspecting recalled memories
state.trace contains a read_memory entry whenever mem0 returned hits:
state = rag.invoke("What is hybrid search?", config={"configurable": {"user_id": "alice"}})
for step in state.trace:
if step["node"] == "read_memory":
print("Recalled facts:\n", step["memories"])
Manually scoping users
user_id partitions all memory ops. Two ways to set it:
# 1. Via the per-call config (most common)
rag.invoke(question, config={"configurable": {"user_id": "alice"}})
# 2. Or bake a default into the agent and let mem0 see it
import functools
ainvoke = functools.partial(rag.ainvoke, config={"configurable": {"user_id": "alice"}})
Failure modes & gotchas
- No mem0 LLM → mem0 still stores raw turns but skips fact
extraction; quality matches
memory_store(raw Q&A). Configure mem0's own LLM to get the deduplication benefit. - Slow first call → mem0's first call seeds embeddings; expect ~1–2 s extra latency on cold start. Subsequent calls are fast.
- Async event loop already running (Jupyter, FastAPI handlers)
→ use
AsyncMemory(). retrievalagent detectsaadd/asearchand avoids the thread-pool round-trip. - mem0 errors are swallowed — retrievalagent's memory nodes catch
exceptions silently so memory hiccups never break the QA path.
Check
state.tracefor missingread_memoryevents when debugging.
Combining with checkpointer
mem0_memory= is orthogonal to checkpointer=. The checkpointer
persists graph state per thread_id (resume a conversation);
mem0 persists extracted facts per user_id (carry preferences
across sessions). Pass both for full coverage.
Long-term memory (memory_store)
Cross-thread memory that persists facts across different conversations and users. After each answer the agent writes a Q&A summary; before each retrieval it reads relevant past exchanges and uses them as context.
from retrievalagent import init_agent
from langgraph.store.memory import InMemoryStore
rag = init_agent(
"docs",
model="openai:gpt-5.4",
backend="qdrant",
backend_url="http://localhost:6333",
memory_store=InMemoryStore(),
)
# Scope memories to a user with user_id
config = {"configurable": {"user_id": "alice"}}
rag.invoke("I prefer answers in German.", config=config)
rag.invoke("What is hybrid search?", config=config)
# Second call remembers the language preference from the first
Combine both for full memory:
from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.store.memory import InMemoryStore
rag = init_agent(
"docs",
model="openai:gpt-5.4",
backend="qdrant",
checkpointer=SqliteSaver.from_conn_string("./memory.db"),
memory_store=InMemoryStore(),
)
config = {"configurable": {"thread_id": "session-1", "user_id": "alice"}}
state = rag.invoke("What is hybrid search?", config=config)
# state.trace includes a 'read_memory' entry showing what was recalled
for step in state.trace:
if step["node"] == "read_memory":
print("Recalled:", step.get("memories"))
For production, replace InMemoryStore with AsyncPostgresStore:
from langgraph.store.postgres import AsyncPostgresStore
store = AsyncPostgresStore.from_conn_string("postgresql://user:pass@localhost/mydb")
await store.setup() # creates tables on first run
rag = init_agent("docs", model="openai:gpt-5.4", memory_store=store)
When to use memory vs history
history= on chat() |
checkpointer= |
memory_store= |
mem0_memory= |
|
|---|---|---|---|---|
| Scope | Single session | Per thread | Per user (cross-thread) | Per user (cross-thread) |
| What's stored | Answer text | Full graph state | Raw Q&A strings | Extracted facts |
| Deduplication | — | — | No | Yes (LLM-based) |
| Survives restarts | No | With SQLite/Postgres | With Postgres store | With mem0 store |
| Use case | Simple multi-turn | Resumable chatbots | Basic long-term context | Smart user preferences |
| Config key | (none) | thread_id |
user_id |
user_id |
Tip
Combine all for full coverage: history= for the current turn, checkpointer= to resume the thread, mem0_memory= to recall extracted facts from previous sessions.