Skip to content

Releases: raphaelchristi/harness-evolver

v6.4.2 — Rate-limit field fix (Codex review)

04 Apr 01:33

Choose a tag to compare

Fixes a regression found by Codex adversarial review in v6.4.1.

Fixed

  • Rate-limit detection checked wrong fieldread_results.py only checked run.error (LangSmith run-level, usually None for subprocess failures) but 429 errors from agent subprocesses go into run.outputs["error"]. Now checks both run.error and outputs.error with the RATE_LIMIT_RE regex.

Install: npx harness-evolver@latest

Full changelog: https://github.com/raphaelchristi/harness-evolver/blob/main/CHANGELOG.md

v6.4.1 — Eval Reliability Fixes

04 Apr 01:19

Choose a tag to compare

Fixes evaluation reliability bugs discovered during real-world testing where best_score: 1.0 was inflated from a true score of ~0.90.

Fixed

  • Rate-limit false positives — Bare "rate" keyword matched "curated", "hydrate", "karate" in agent output. Now uses specific regex and only checks error fields.
  • Split-blind sampling--sample 10 left only 2-3 held_out examples for comparison. New --sample-split train flag ensures all held_out examples are always evaluated.
  • stderr head truncation — Error tracebacks were invisible because stderr[:500] captured startup banners. Now captures stderr[-500:].
  • Certify canary — Removed --no-canary so broken agents are caught immediately.

Added

  • Minimum N guard — Warning + low_confidence: true when comparison has < 5 scored examples.
  • --sample-split flag for split-aware sampling in run_eval.py.

Full changelog: https://github.com/raphaelchristi/harness-evolver/blob/main/CHANGELOG.md

v6.4.0 — Compound Learning + Score Certification

03 Apr 21:43

Choose a tag to compare

Two competitive-inspired features that make evolution sessions permanent and scores verifiable.

Added

  • Compound Learningtools/promote_learnings.py extracts proven insights (rec >= 5) from evolution_memory.md and appends them as permanent rules to CLAUDE.md. Each evolution permanently improves the project, not just the code.
  • /harness:certify — Runs eval 3x, reports mean ± std with STABLE/MARGINAL/UNSTABLE verdict. Verifies LLM-as-judge consistency before deploying.
  • Consolidator Phase 5: Promote — Flags high-recurrence insights as promotion candidates with clear anchored vs promoted terminology.
  • Deploy "Promote learnings" — 4th option in /harness:deploy with dry-run preview + user consent.

Inspired by Compound Engineering, Self-Improving Agent, and PluginEval Monte Carlo.

Full changelog: https://github.com/raphaelchristi/harness-evolver/blob/main/CHANGELOG.md

v6.3.2 — Fix CWD Resolution in Worktrees

03 Apr 20:38

Choose a tag to compare

Fixed

--show-toplevel--git-common-dir in evolve skill. --show-toplevel returns the worktree root (wrong when CWD is inside a worktree). dirname $(git rev-parse --git-common-dir) returns the main repo root. Confirmed by end-to-end test.

v6.3.1 — CWD Drift Fix

03 Apr 20:29

Choose a tag to compare

Fixed

$(pwd)$(git rev-parse --show-toplevel) in evolve skill. Prevents double-nested worktree paths when CWD drifts during proposer spawn. Found in agno-deepknowledge end-to-end test.

v6.3.0 — 5 Verdict Improvements

03 Apr 20:05

Choose a tag to compare

What's New

5 improvements from the testing agent's verdict (P0+P1+P2).

P0: update_config.py

Replaces manual inline Python for config updates. Three actions:

  • --action backup before merge
  • --action restore after merge overwrites
  • --action update --winner-experiment X --winner-score Y

P0: cleanup_worktrees.py

Removes orphan worktrees after eval. Prevents 6+ worktree accumulation.
--dry-run to preview, --keep <name> to preserve specific ones.

P1: --retry-on-rate-limit in run_eval.py

When rate-limited and this flag is set, waits 60s and suggests re-run.

P1: Evolve skill simplified

Merge + config update is now 4 tool calls instead of 10 lines of inline Python.
Worktree cleanup added at end of post-iteration.

P2: Rubric pinning

Evaluator includes rubric text in feedback comment (RUBRIC: ... JUDGMENT: ...).
Makes scores reproducible and diagnosable across iterations.

16/16 tests passing.

🤖 Claude + Codex/GPT-5.4

v6.2.0 — Evolution Tracing to LangSmith

03 Apr 19:57

Choose a tag to compare

What's New

Evolution Tracing

Each iteration logged as a LangSmith run with score, approach, lens, duration, and merge decision. Creates a persistent timeline in LangSmith UI — no more losing evolution history when the terminal scrolls away.

# Start iteration (returns run_id + dotted_order)
python tools/log_iteration.py --config .evolver.json --action start --version v001

# End iteration (update with results)
python tools/log_iteration.py --action end --run-id <id> --score 0.85 --merged true

Proposer Trace Nesting

CC_LANGSMITH_PARENT_DOTTED_ORDER passed to proposer environment. With the langsmith-tracing companion, proposer tool calls (reads, edits, commits) nest hierarchically under iteration runs:

iteration-v002 → Proposer 1 → Read strategy.md → Edit agent.py → Commit
              → Proposer 2 → Read trace_insights.json → Edit prompt.py → Commit
              → Eval (10 runs)
              → LLM Judge
              → Merge

Companion Plugin Recommended

README + setup skill now recommend installing langsmith-tracing for full observability.

14/14 tests passing.

🤖 Claude + Codex/GPT-5.4

v6.1.0 — Config Merge Protection + Rate-Limit Score Exclusion

03 Apr 19:17

Choose a tag to compare

What's Fixed

From first real multi-iteration run (agno-deepknowledge: baseline 0.575 → v002 0.950).

Config preserved across merges

git merge from worktrees silently overwrote .evolver.json with the stale copy. Fix: backup → merge → restore → update. Previously caused .evolver.json to show iterations: 0 after successful merges.

Rate-limited runs excluded from scores

  • Evaluator agent: skips 429 runs entirely (no feedback written)
  • read_results.py: filters rate-limited runs from combined_score, reports num_scored separately

Before: 4/10 correct + 6/10 rate-limited = score 0.4
After: 4/4 scored (rate-limited excluded) = score 1.0

Namespace cleanup

Last /evolver: references in setup.py and installer renamed to /harness:.

🤖 Claude + Codex/GPT-5.4

v6.0.2 — Fix Preflight Key Validation, Venv Warning, Namespace

03 Apr 17:59

Choose a tag to compare

Fixed

  • Preflight rejects dummy API keyscheck_api_key() now validates format (30+ chars, no lsv2_pt_test*). Dummy key in credentials file no longer passes [1/5] silently.
  • Setup warns about missing project venv — Detects no .venv/ and warns. Also warns if entry_point uses ~/.evolver/venv (tools-only).
  • Remaining /evolver:/harness: — 4 Python tools had stale namespace references in print messages.

🤖 Claude + Codex/GPT-5.4

v6.0.1 — Validate API Keys + Venv Warning

03 Apr 17:52

Choose a tag to compare

Fixed

  • API key validationensure_langsmith_api_key() rejects dummy/test keys (< 30 chars or lsv2_pt_test*). Prints warning and tries next source instead of silently using an invalid key that causes 403.
  • Setup warns when no project venv — Explicitly tells user NOT to use ~/.evolver/venv as entry_point. Instructs to create project venv first.

🤖 Claude + Codex/GPT-5.4