Releases · raphaelchristi/harness-evolver

04 Apr 01:33

raphaelchristi

v6.4.2

69df8bb

v6.4.2 — Rate-limit field fix (Codex review) Latest

Latest

Fixes a regression found by Codex adversarial review in v6.4.1.

Fixed

Rate-limit detection checked wrong field — read_results.py only checked run.error (LangSmith run-level, usually None for subprocess failures) but 429 errors from agent subprocesses go into run.outputs["error"]. Now checks both run.error and outputs.error with the RATE_LIMIT_RE regex.

Install: npx harness-evolver@latest

Full changelog: https://github.com/raphaelchristi/harness-evolver/blob/main/CHANGELOG.md

Assets 2

04 Apr 01:19

raphaelchristi

v6.4.1

1da4274

v6.4.1 — Eval Reliability Fixes

Fixes evaluation reliability bugs discovered during real-world testing where best_score: 1.0 was inflated from a true score of ~0.90.

Fixed

Rate-limit false positives — Bare "rate" keyword matched "curated", "hydrate", "karate" in agent output. Now uses specific regex and only checks error fields.
Split-blind sampling — --sample 10 left only 2-3 held_out examples for comparison. New --sample-split train flag ensures all held_out examples are always evaluated.
stderr head truncation — Error tracebacks were invisible because stderr[:500] captured startup banners. Now captures stderr[-500:].
Certify canary — Removed --no-canary so broken agents are caught immediately.

Added

Minimum N guard — Warning + low_confidence: true when comparison has < 5 scored examples.
--sample-split flag for split-aware sampling in run_eval.py.

Full changelog: https://github.com/raphaelchristi/harness-evolver/blob/main/CHANGELOG.md

Assets 2

03 Apr 21:43

raphaelchristi

v6.4.0

88087ee

v6.4.0 — Compound Learning + Score Certification

Two competitive-inspired features that make evolution sessions permanent and scores verifiable.

Added

Compound Learning — tools/promote_learnings.py extracts proven insights (rec >= 5) from evolution_memory.md and appends them as permanent rules to CLAUDE.md. Each evolution permanently improves the project, not just the code.
/harness:certify — Runs eval 3x, reports mean ± std with STABLE/MARGINAL/UNSTABLE verdict. Verifies LLM-as-judge consistency before deploying.
Consolidator Phase 5: Promote — Flags high-recurrence insights as promotion candidates with clear anchored vs promoted terminology.
Deploy "Promote learnings" — 4th option in /harness:deploy with dry-run preview + user consent.

Inspired by Compound Engineering, Self-Improving Agent, and PluginEval Monte Carlo.

Full changelog: https://github.com/raphaelchristi/harness-evolver/blob/main/CHANGELOG.md

Assets 2

03 Apr 20:38

raphaelchristi

v6.3.2

34f57df

v6.3.2 — Fix CWD Resolution in Worktrees

Fixed

--show-toplevel → --git-common-dir in evolve skill. --show-toplevel returns the worktree root (wrong when CWD is inside a worktree). dirname $(git rev-parse --git-common-dir) returns the main repo root. Confirmed by end-to-end test.

Assets 2

03 Apr 20:29

raphaelchristi

v6.3.1

5d99308

v6.3.1 — CWD Drift Fix

Fixed

$(pwd) → $(git rev-parse --show-toplevel) in evolve skill. Prevents double-nested worktree paths when CWD drifts during proposer spawn. Found in agno-deepknowledge end-to-end test.

Assets 2

03 Apr 20:05

raphaelchristi

v6.3.0

bf9e8a9

v6.3.0 — 5 Verdict Improvements

What's New

5 improvements from the testing agent's verdict (P0+P1+P2).

P0: `update_config.py`

Replaces manual inline Python for config updates. Three actions:

--action backup before merge
--action restore after merge overwrites
--action update --winner-experiment X --winner-score Y

P0: `cleanup_worktrees.py`

Removes orphan worktrees after eval. Prevents 6+ worktree accumulation.
--dry-run to preview, --keep <name> to preserve specific ones.

P1: `--retry-on-rate-limit` in run_eval.py

When rate-limited and this flag is set, waits 60s and suggests re-run.

P1: Evolve skill simplified

Merge + config update is now 4 tool calls instead of 10 lines of inline Python.
Worktree cleanup added at end of post-iteration.

P2: Rubric pinning

Evaluator includes rubric text in feedback comment (RUBRIC: ... JUDGMENT: ...).
Makes scores reproducible and diagnosable across iterations.

16/16 tests passing.

🤖 Claude + Codex/GPT-5.4

Assets 2

03 Apr 19:57

raphaelchristi

v6.2.0

35ca6c6

v6.2.0 — Evolution Tracing to LangSmith

What's New

Evolution Tracing

Each iteration logged as a LangSmith run with score, approach, lens, duration, and merge decision. Creates a persistent timeline in LangSmith UI — no more losing evolution history when the terminal scrolls away.

# Start iteration (returns run_id + dotted_order)
python tools/log_iteration.py --config .evolver.json --action start --version v001

# End iteration (update with results)
python tools/log_iteration.py --action end --run-id <id> --score 0.85 --merged true

Proposer Trace Nesting

CC_LANGSMITH_PARENT_DOTTED_ORDER passed to proposer environment. With the langsmith-tracing companion, proposer tool calls (reads, edits, commits) nest hierarchically under iteration runs:

iteration-v002 → Proposer 1 → Read strategy.md → Edit agent.py → Commit
              → Proposer 2 → Read trace_insights.json → Edit prompt.py → Commit
              → Eval (10 runs)
              → LLM Judge
              → Merge

Companion Plugin Recommended

README + setup skill now recommend installing langsmith-tracing for full observability.

14/14 tests passing.

🤖 Claude + Codex/GPT-5.4

Assets 2

03 Apr 19:17

raphaelchristi

v6.1.0

e519bce

v6.1.0 — Config Merge Protection + Rate-Limit Score Exclusion

What's Fixed

From first real multi-iteration run (agno-deepknowledge: baseline 0.575 → v002 0.950).

Config preserved across merges

git merge from worktrees silently overwrote .evolver.json with the stale copy. Fix: backup → merge → restore → update. Previously caused .evolver.json to show iterations: 0 after successful merges.

Rate-limited runs excluded from scores

Evaluator agent: skips 429 runs entirely (no feedback written)
read_results.py: filters rate-limited runs from combined_score, reports num_scored separately

Before: 4/10 correct + 6/10 rate-limited = score 0.4
After: 4/4 scored (rate-limited excluded) = score 1.0

Namespace cleanup

Last /evolver: references in setup.py and installer renamed to /harness:.

🤖 Claude + Codex/GPT-5.4

Assets 2

03 Apr 17:59

raphaelchristi

v6.0.2

dd19ea5

v6.0.2 — Fix Preflight Key Validation, Venv Warning, Namespace

Fixed

Preflight rejects dummy API keys — check_api_key() now validates format (30+ chars, no lsv2_pt_test*). Dummy key in credentials file no longer passes [1/5] silently.
Setup warns about missing project venv — Detects no .venv/ and warns. Also warns if entry_point uses ~/.evolver/venv (tools-only).
Remaining /evolver: → /harness: — 4 Python tools had stale namespace references in print messages.

🤖 Claude + Codex/GPT-5.4

Assets 2

03 Apr 17:52

raphaelchristi

v6.0.1

6618b9a

v6.0.1 — Validate API Keys + Venv Warning

Fixed

API key validation — ensure_langsmith_api_key() rejects dummy/test keys (< 30 chars or lsv2_pt_test*). Prints warning and tries next source instead of silently using an invalid key that causes 403.
Setup warns when no project venv — Explicitly tells user NOT to use ~/.evolver/venv as entry_point. Instructs to create project venv first.

🤖 Claude + Codex/GPT-5.4

Assets 2

Releases: raphaelchristi/harness-evolver

v6.4.2 — Rate-limit field fix (Codex review)

Fixed

Uh oh!

v6.4.1 — Eval Reliability Fixes

Fixed

Added

Uh oh!

v6.4.0 — Compound Learning + Score Certification

Added

Uh oh!

v6.3.2 — Fix CWD Resolution in Worktrees

Fixed

Uh oh!

v6.3.1 — CWD Drift Fix

Fixed

Uh oh!

v6.3.0 — 5 Verdict Improvements

What's New

P0: update_config.py

P0: cleanup_worktrees.py

P1: --retry-on-rate-limit in run_eval.py

P1: Evolve skill simplified

P2: Rubric pinning

Uh oh!

v6.2.0 — Evolution Tracing to LangSmith

What's New

Evolution Tracing

Proposer Trace Nesting

Companion Plugin Recommended

Uh oh!

v6.1.0 — Config Merge Protection + Rate-Limit Score Exclusion

What's Fixed

Config preserved across merges

Rate-limited runs excluded from scores

Namespace cleanup

Uh oh!

v6.0.2 — Fix Preflight Key Validation, Venv Warning, Namespace

Fixed

Uh oh!

v6.0.1 — Validate API Keys + Venv Warning

Fixed

Uh oh!

P0: `update_config.py`

P0: `cleanup_worktrees.py`

P1: `--retry-on-rate-limit` in run_eval.py