Releases: raphaelchristi/harness-evolver
v6.4.2 — Rate-limit field fix (Codex review)
Fixes a regression found by Codex adversarial review in v6.4.1.
Fixed
- Rate-limit detection checked wrong field —
read_results.pyonly checkedrun.error(LangSmith run-level, usually None for subprocess failures) but 429 errors from agent subprocesses go intorun.outputs["error"]. Now checks bothrun.errorandoutputs.errorwith theRATE_LIMIT_REregex.
Install: npx harness-evolver@latest
Full changelog: https://github.com/raphaelchristi/harness-evolver/blob/main/CHANGELOG.md
v6.4.1 — Eval Reliability Fixes
Fixes evaluation reliability bugs discovered during real-world testing where best_score: 1.0 was inflated from a true score of ~0.90.
Fixed
- Rate-limit false positives — Bare
"rate"keyword matched "curated", "hydrate", "karate" in agent output. Now uses specific regex and only checks error fields. - Split-blind sampling —
--sample 10left only 2-3 held_out examples for comparison. New--sample-split trainflag ensures all held_out examples are always evaluated. - stderr head truncation — Error tracebacks were invisible because
stderr[:500]captured startup banners. Now capturesstderr[-500:]. - Certify canary — Removed
--no-canaryso broken agents are caught immediately.
Added
- Minimum N guard — Warning +
low_confidence: truewhen comparison has < 5 scored examples. --sample-splitflag for split-aware sampling inrun_eval.py.
Full changelog: https://github.com/raphaelchristi/harness-evolver/blob/main/CHANGELOG.md
v6.4.0 — Compound Learning + Score Certification
Two competitive-inspired features that make evolution sessions permanent and scores verifiable.
Added
- Compound Learning —
tools/promote_learnings.pyextracts proven insights (rec >= 5) fromevolution_memory.mdand appends them as permanent rules to CLAUDE.md. Each evolution permanently improves the project, not just the code. /harness:certify— Runs eval 3x, reports mean ± std with STABLE/MARGINAL/UNSTABLE verdict. Verifies LLM-as-judge consistency before deploying.- Consolidator Phase 5: Promote — Flags high-recurrence insights as promotion candidates with clear anchored vs promoted terminology.
- Deploy "Promote learnings" — 4th option in
/harness:deploywith dry-run preview + user consent.
Inspired by Compound Engineering, Self-Improving Agent, and PluginEval Monte Carlo.
Full changelog: https://github.com/raphaelchristi/harness-evolver/blob/main/CHANGELOG.md
v6.3.2 — Fix CWD Resolution in Worktrees
Fixed
--show-toplevel → --git-common-dir in evolve skill. --show-toplevel returns the worktree root (wrong when CWD is inside a worktree). dirname $(git rev-parse --git-common-dir) returns the main repo root. Confirmed by end-to-end test.
v6.3.1 — CWD Drift Fix
Fixed
$(pwd) → $(git rev-parse --show-toplevel) in evolve skill. Prevents double-nested worktree paths when CWD drifts during proposer spawn. Found in agno-deepknowledge end-to-end test.
v6.3.0 — 5 Verdict Improvements
What's New
5 improvements from the testing agent's verdict (P0+P1+P2).
P0: update_config.py
Replaces manual inline Python for config updates. Three actions:
--action backupbefore merge--action restoreafter merge overwrites--action update --winner-experiment X --winner-score Y
P0: cleanup_worktrees.py
Removes orphan worktrees after eval. Prevents 6+ worktree accumulation.
--dry-run to preview, --keep <name> to preserve specific ones.
P1: --retry-on-rate-limit in run_eval.py
When rate-limited and this flag is set, waits 60s and suggests re-run.
P1: Evolve skill simplified
Merge + config update is now 4 tool calls instead of 10 lines of inline Python.
Worktree cleanup added at end of post-iteration.
P2: Rubric pinning
Evaluator includes rubric text in feedback comment (RUBRIC: ... JUDGMENT: ...).
Makes scores reproducible and diagnosable across iterations.
16/16 tests passing.
🤖 Claude + Codex/GPT-5.4
v6.2.0 — Evolution Tracing to LangSmith
What's New
Evolution Tracing
Each iteration logged as a LangSmith run with score, approach, lens, duration, and merge decision. Creates a persistent timeline in LangSmith UI — no more losing evolution history when the terminal scrolls away.
# Start iteration (returns run_id + dotted_order)
python tools/log_iteration.py --config .evolver.json --action start --version v001
# End iteration (update with results)
python tools/log_iteration.py --action end --run-id <id> --score 0.85 --merged trueProposer Trace Nesting
CC_LANGSMITH_PARENT_DOTTED_ORDER passed to proposer environment. With the langsmith-tracing companion, proposer tool calls (reads, edits, commits) nest hierarchically under iteration runs:
iteration-v002 → Proposer 1 → Read strategy.md → Edit agent.py → Commit
→ Proposer 2 → Read trace_insights.json → Edit prompt.py → Commit
→ Eval (10 runs)
→ LLM Judge
→ Merge
Companion Plugin Recommended
README + setup skill now recommend installing langsmith-tracing for full observability.
14/14 tests passing.
🤖 Claude + Codex/GPT-5.4
v6.1.0 — Config Merge Protection + Rate-Limit Score Exclusion
What's Fixed
From first real multi-iteration run (agno-deepknowledge: baseline 0.575 → v002 0.950).
Config preserved across merges
git merge from worktrees silently overwrote .evolver.json with the stale copy. Fix: backup → merge → restore → update. Previously caused .evolver.json to show iterations: 0 after successful merges.
Rate-limited runs excluded from scores
- Evaluator agent: skips 429 runs entirely (no feedback written)
- read_results.py: filters rate-limited runs from
combined_score, reportsnum_scoredseparately
Before: 4/10 correct + 6/10 rate-limited = score 0.4
After: 4/4 scored (rate-limited excluded) = score 1.0
Namespace cleanup
Last /evolver: references in setup.py and installer renamed to /harness:.
🤖 Claude + Codex/GPT-5.4
v6.0.2 — Fix Preflight Key Validation, Venv Warning, Namespace
Fixed
- Preflight rejects dummy API keys —
check_api_key()now validates format (30+ chars, nolsv2_pt_test*). Dummy key in credentials file no longer passes [1/5] silently. - Setup warns about missing project venv — Detects no
.venv/and warns. Also warns if entry_point uses~/.evolver/venv(tools-only). - Remaining
/evolver:→/harness:— 4 Python tools had stale namespace references in print messages.
🤖 Claude + Codex/GPT-5.4
v6.0.1 — Validate API Keys + Venv Warning
Fixed
- API key validation —
ensure_langsmith_api_key()rejects dummy/test keys (< 30 chars orlsv2_pt_test*). Prints warning and tries next source instead of silently using an invalid key that causes 403. - Setup warns when no project venv — Explicitly tells user NOT to use
~/.evolver/venvas entry_point. Instructs to create project venv first.
🤖 Claude + Codex/GPT-5.4