How should retrieved memories be injected to preserve prompt caching?

## Question

When using `memory.search()` results with LLMs that support prompt caching (Anthropic, OpenAI), what's the recommended way to inject memories into the conversation?

The common pattern I see in most examples is injecting directly into the system prompt:

```python
relevant_memory = await memory.search(user_id=user_id, query=query)
system_prompt = f"""You are an assistant.
What you know about the user:
{relevant_memory}"""
```

Since search results differ per query, the system prompt diverges early — only the static prefix before the memory block gets a cache hit, which is often a minimal portion of the total prompt tokens.

## Approaches I've considered

1. **Tool-based retrieval** — wrap `memory.search()` as a tool, let the LLM call it when needed. System prompt stays static → cache preserved.
2. **Separate message block** — keep system prompt static, put memories in a separate user/system message.
3. **Two-tier** — static user profile in system prompt + dynamic recall via tool.

## What I'd like to know

- Is there a recommended pattern from the mem0 team?
- Has anyone benchmarked the cost/latency difference between these approaches?
- Are there plans to support a tool-based retrieval mode natively?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should retrieved memories be injected to preserve prompt caching? #4341

Question

Approaches I've considered

What I'd like to know

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How should retrieved memories be injected to preserve prompt caching? #4341

Description

Question

Approaches I've considered

What I'd like to know

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions