A-Evolve's evolution algorithm. It analyzes per-claim feedback, stratifies performance by task type, mines judge justifications for root causes, and uses meta-learning to make surgical, evidence-based mutations — targeting the specific weaknesses that matter most.
The agent workspace (prompts, skills, memory) is a directory on disk. The evolution engine reads observation logs from solve cycles, analyzes them across multiple layers, and mutates workspace files to improve the agent. Every accepted mutation is git-tagged for full reproducibility and rollback.
Each cycle runs eight phases:
Observation Logs (from solve cycles)
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 1 — Base Analysis │
│ Aggregate pass/fail rates, tool error counts, │
│ hallucinated tool names, strategy issues. │
├─────────────────────────────────────────────────────────────┤
│ Phase 2 — Code Execution Analysis │
│ How often did the agent use execute_code? │
│ Pass rate with vs. without code. Missed opportunities. │
├─────────────────────────────────────────────────────────────┤
│ Phase 3 — Adaptive Analysis │
│ Per-claim breakdown: which requirement types fail? │
│ Per-task-type breakdown: which task categories are weak? │
│ Judge feedback mining: what do judges repeatedly say? │
│ Failure pattern detection: systematic issues. │
├─────────────────────────────────────────────────────────────┤
│ Phase 4 — Auto-Corrections │
│ Fix hallucinated tool names in skills/prompts. │
│ Prune memory to cap (default: 15 entries). │
├─────────────────────────────────────────────────────────────┤
│ Phase 5 — Auto-Seed Skills │
│ If failure patterns cross thresholds, inject targeted │
│ skills immediately (before the LLM runs). │
├─────────────────────────────────────────────────────────────┤
│ Phase 6 — LLM-Driven Evolution │
│ Evolver LLM receives the full analysis + evolution │
│ history. It reads/writes workspace files via bash tool. │
│ Graduated scope: intensity scales with performance. │
├─────────────────────────────────────────────────────────────┤
│ Phase 7 — Sanity Check │
│ Deterministic post-mutation fixes (prompt length, │
│ skill format, broken references). │
├─────────────────────────────────────────────────────────────┤
│ Phase 8 — Meta-Evolution │
│ Record what changed and its impact. Detect stagnation. │
│ Roll back to best-known state if no improvement for │
│ N cycles. │
└─────────────────────────────────────────────────────────────┘
│
▼
Mutated Workspace → Agent reloads → Next solve cycle
Suppose the agent scores 68% on a batch. The analysis reveals:
Claim-type breakdown:
provide_fact — 85% pass rate ✓ strong
calculate — 40% pass rate ✗ weak
entity_property — 90% pass rate ✓ strong
Task-type breakdown:
single_fact — 90% ✓
multi_requirement — 55% ✗ weak
search_iteration — 70% ~
Failure patterns detected:
multi_requirement_miss — 5 tasks (score ~0.5, agent drops 2nd requirement)
wrong_entity_targeting — 3 tasks (score 0.0, agent answered about wrong entity)
Judge feedback:
missing_data — 8 occurrences ("agent did not provide the calculated difference")
wrong_entity — 3 occurrences ("agent returned data for a different repository")
The engine responds with:
- Auto-seeds a
multi-requirement-handlerskill (threshold: ≥3 multi-req misses) and anentity-verificationskill (threshold: ≥2 wrong-entity failures). - Auto-seeds a
calculate-handlerskill because thecalculateclaim type is below 50%. - Graduated scope selects "targeted" intensity (68% < 70%) → both prompt and skills are eligible for LLM mutation.
- Evolver LLM receives the full analysis, sees the evolution history ("last cycle we added X and it helped by +3%"), and makes targeted edits.
- Meta-evolution records the changes. If the next 5 cycles show no improvement, the engine rolls back to the best git-tagged state.
Instead of just pass/fail per task, the analyzer breaks benchmark feedback into individual claims and classifies each one:
| Claim Type | Keywords | Example |
|---|---|---|
provide_fact |
provide, what is, get, return | "What is the creation date of repo X?" |
calculate |
difference, sum, calculate, how many | "How many years between X and Y?" |
compare |
compare, difference between, versus | "Compare stars of repo A vs repo B" |
aggregate |
total, all, list all, every | "List all open issues" |
identify_entity |
identify, find, which, who | "Which user created this repo?" |
entity_property |
status, date, name, owner | "What is the status of issue #42?" |
chain |
then, after, using, next | "Get X, then use it to find Y" |
conditional |
if, when, where, in case | "If repo is public, get its stars" |
Each claim type gets its own pass rate. The weakest types drive evolution priorities.
Tasks are classified by input text patterns so the engine knows which categories need help:
| Task Type | Signals | Complexity |
|---|---|---|
single_fact |
"what is", "when was", "who is" | Low |
multi_requirement |
" and ", " also ", bullet points | Medium |
search_iteration |
"find", "search for", "all", "list" | Medium-High |
comparison |
"compare", "difference between", "vs" | Medium |
action |
"create", "update", "delete", "send" | Medium |
calculation |
"calculate", "compute", "sum", "total" | Low-Medium |
LLM judge justifications are scanned for recurring patterns:
missing_data— "not mention", "missing", "does not include"wrong_entity— "wrong", "different", "incorrect", "not the"partial_answer— "partial", "incomplete", "only provides"calculation_error— "incorrect calculation", "wrong number"wrong_format— "format", "structure", "not formatted"
High-frequency patterns become explicit evolution targets.
The engine detects four systematic failure patterns:
| Pattern | Signal | Suggested Fix |
|---|---|---|
multi_requirement_miss |
Score ~0.5 on tasks with "and"/"also" | Structured requirement extraction protocol |
wrong_entity_targeting |
Score 0.0 despite long output | Early entity verification checkpoint |
near_miss |
Score ~0.7 (most claims pass, one missed) | Strengthen final verification |
missed_code_opportunity |
15+ tool calls, no code execution, failed | Lower code execution threshold |
When patterns cross thresholds, skills are injected before the LLM runs:
- ≥3 multi-requirement misses → seeds
multi-requirement-handler(requirement extraction + per-requirement tracking protocol) - ≥2 wrong-entity failures → seeds
entity-verification(mandatory identity checkpoint after first tool call) - Claim type below 50% pass rate → seeds
{type}-handlerwith type-specific guidance (e.g.,calculate-handlerteaches the agent to show explicit calculations)
The engine scales mutation intensity to current performance:
| Pass Rate | Intensity | What Changes |
|---|---|---|
| ≥90% + stable (2+ cycles) | None | Skip LLM evolution entirely |
| ≥90% (unstable) | Minimal | Skills only |
| ≥85% | Minimal | Skills only if a specific claim type is below 60% |
| ≥70% | Targeted | Skills + prompt (only if failure patterns require it) |
| <70% | Comprehensive | Prompt + skills + memory |
This prevents over-mutation when the agent is already performing well.
The engine maintains a rolling history of the last 10 evolution cycles, recording what changed and its impact. This history is fed to the evolver LLM so it can:
- Build on changes that improved performance
- Avoid repeating changes that hurt performance
- Understand the trajectory of improvement
If no improvement ≥ improvement_threshold (default: 2%) occurs for stagnation_window cycles (default: 5), and either:
- The agent has degraded >5% from its best, or
- The best score was below 90%
…the engine rolls back to the best-known git-tagged state and resets the counter.
The solver agent has access to an execute_code tool — a sandboxed Python environment that can call any MCP tool programmatically via call_tool(name, args). This is central to how the agent handles complex tasks, and the evolution engine actively monitors and optimizes its usage.
execute_code accepts a code string and runs it in a restricted Python sandbox. Inside the sandbox the agent can:
- Call any MCP tool via
call_tool(name, args)→ returns a string result - Use
print()to return output to the LLM context - Import a whitelisted set of modules:
json,re,math,datetime,collections,itertools,functools
The sandbox enforces a 200-call safety limit and a 120-second timeout. Output is truncated to 8,000 characters.
Without execute_code, every tool call flows through the LLM context window — the agent calls a tool, reads the result, reasons about it, calls the next tool, and so on. For tasks that require searching, iterating, or chaining many calls, this is slow and eats context.
With execute_code, the agent writes a Python loop that calls tools directly. Intermediate results stay in the execution environment and never enter the LLM context. Only the final print() output comes back.
Without execute_code (15 round-trips):
LLM → tool_call → result → LLM → tool_call → result → LLM → ... → answer
With execute_code (1 round-trip):
LLM → execute_code("""
for candidate in candidates:
result = call_tool("search", {"query": candidate})
if "found" in result:
print(result)
break
""") → answer
| Scenario | Use execute_code? |
|---|---|
| Task needs 1–2 simple tool calls with known parameters | No — direct calls |
| Task requires searching/iterating over multiple values | Yes |
| Task chains 3+ tool calls where output feeds into the next | Yes |
| Task needs to filter or aggregate large result sets | Yes |
| A tool call fails and the agent wants to retry with variations | Yes |
| Agent needs to reason carefully about each intermediate result | No — direct calls |
The adaptive analyzer tracks code execution usage across every batch:
tasks_used_code/tasks_no_code— how many tasks usedexecute_codecode_pass_ratevsno_code_pass_rate— pass rate with and without code executionmissed_opportunities— tasks that made 15+ direct tool calls, failed, and never usedexecute_code(strong signal the agent should have written a loop)effective_patterns— tasks where code execution was used and the agent passed
When the engine detects missed opportunities, it:
- Flags the
missed_code_opportunityfailure pattern - Reports it to the evolver LLM with specific task IDs
- On workspace preparation, patches the system prompt with code execution guidance and seeds a
code-execution-patternsskill that teaches the agent when and how to useexecute_code
When AdaptiveEvolveEngine.prepare_workspace() is called on a fresh workspace, it:
- Patches
prompts/system.mdwith a code execution section (if not already present) - Seeds a
skills/code-execution-patterns/SKILL.mdwith usage patterns (search loops, tool chaining, retry with fallbacks)
This ensures the agent knows about execute_code from the first solve cycle, before any evolution has occurred.
| analyzer.py | AdaptiveAnalyzer — multi-layer analysis (claims, task types, judge feedback, failure patterns) |
| prompts.py | Prompt construction for the evolver LLM with claim/task breakdowns and evolution history |
| __init__.py | Public API exports |
The main engine. Implements both the EvolutionEngine.step() interface (for use inside the Evolver loop) and a standalone evolve() method.
from agent_evolve.algorithms.adaptive_evolve import AdaptiveEvolveEngine
engine = AdaptiveEvolveEngine(
config=config,
llm=llm_provider, # optional, auto-created from config
improvement_threshold=0.02, # minimum improvement to reset stagnation counter
stagnation_window=5, # cycles without improvement before rollback
memory_cap=15, # max memory entries
)Composes four sub-analyzers into a single AdaptiveAnalysisResult:
ClaimAnalyzer— classifies claims by type and tracks per-type pass ratesTaskTypeDetector/TaskTypePerformanceTracker— detects and tracks task typesJudgeFeedbackMiner— extracts recurring failure reasons from judge justificationsFailurePatternDetector— identifies systematic failure patterns
from agent_evolve.algorithms.adaptive_evolve import AdaptiveAnalyzer
analyzer = AdaptiveAnalyzer()
result = analyzer.analyze(observations, base_analysis, code_stats)
# result.weakest_claim_types → [("calculate", 0.40), ("compare", 0.55)]
# result.failure_patterns → [FailurePattern(pattern_name="multi_requirement_miss", ...)]
# result.evolution_recommendations → ["Create skill for 'calculate' claims", ...]Controls prompt generation and evolver LLM constraints:
from agent_evolve.algorithms.adaptive_evolve import AdaptivePromptConfig
cfg = AdaptivePromptConfig(
prompt_max_chars=4000, # system prompt length limit
skill_max_chars=2000, # per-skill length limit
max_skills=15, # maximum number of skills
include_claim_details=True, # show per-claim breakdowns to evolver
include_judge_patterns=True, # show judge feedback patterns
include_task_type_stats=True, # show task-type performance
include_evolution_history=True, # show what worked/didn't in past cycles
)import agent_evolve as ae
evolver = ae.Evolver(
agent="mcp-atlas",
benchmark="mcp-atlas",
engine=AdaptiveEvolveEngine(config),
)
results = evolver.run(cycles=10)engine = AdaptiveEvolveEngine(config)
result = engine.evolve(workspace, observation_logs, evo_number=3)
print(result["pass_rate"]) # 0.794
print(result["failure_patterns"]) # ["multi_requirement_miss", "near_miss"]
print(result["weakest_claim_types"]) # {"calculate": 0.40, "compare": 0.55}
print(result["rejected"]) # False (True if rolled back due to stagnation)# Adaptive evolve (with evolution)
uv run python examples/mcp_examples/adaptive_evolve_all.py \
--solver-model us.anthropic.claude-opus-4-6-v1 \
--evolver-model us.anthropic.claude-opus-4-6-v1 \
--judge-model us.anthropic.claude-opus-4-6-v1 \
--region us-west-2 \
--env-file .env \
--docker-image ghcr.io/scaleapi/mcp-atlas:latest \
--limit 500 \
--batch-size 30
# Baseline (no evolution)
uv run python examples/mcp_examples/adaptive_evolve_baseline.py \
--solver-model us.anthropic.claude-opus-4-6-v1 \
--judge-model us.anthropic.claude-opus-4-6-v1 \
--region us-west-2 \
--env-file .env \
--docker-image ghcr.io/scaleapi/mcp-atlas:latest \
--limit 500 \
--batch-size 30