-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Open
Description
Question
When using memory.search() results with LLMs that support prompt caching (Anthropic, OpenAI), what's the recommended way to inject memories into the conversation?
The common pattern I see in most examples is injecting directly into the system prompt:
relevant_memory = await memory.search(user_id=user_id, query=query)
system_prompt = f"""You are an assistant.
What you know about the user:
{relevant_memory}"""Since search results differ per query, the system prompt diverges early — only the static prefix before the memory block gets a cache hit, which is often a minimal portion of the total prompt tokens.
Approaches I've considered
- Tool-based retrieval — wrap
memory.search()as a tool, let the LLM call it when needed. System prompt stays static → cache preserved. - Separate message block — keep system prompt static, put memories in a separate user/system message.
- Two-tier — static user profile in system prompt + dynamic recall via tool.
What I'd like to know
- Is there a recommended pattern from the mem0 team?
- Has anyone benchmarked the cost/latency difference between these approaches?
- Are there plans to support a tool-based retrieval mode natively?
Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels