feat: add cast eval command for agent evaluation pipelines#9
Open
Abhiram-Vengala wants to merge 1 commit into
Open
feat: add cast eval command for agent evaluation pipelines#9Abhiram-Vengala wants to merge 1 commit into
Abhiram-Vengala wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces
cast eval— an evaluation pipeline that lets developers run a test suite against their agent before deploying, and gate CI on the results. It also addscast eval:initto scaffold a startercastari.eval.jsonin the current directory.Evaluation pipelines are now a standard layer in AI agent frameworks. Without this, there's no way to catch regressions between deploys — an agent that worked yesterday could silently break today.
New Commands
cast evalcast eval:initcastari.eval.jsonstarter fileHow It Works
Define test cases in
castari.eval.jsonat your project root:{ "name": "My Agent Suite", "cases": [ { "name": "returns valid Python", "input": "write a function that adds two numbers", "assert": [ { "type": "contains", "expected": "def " }, { "type": "regex", "pattern": "def \\w+\\(" }, { "type": "not-contains", "expected": "SyntaxError" } ] } ] }Then run:
Grader Types
All four graders are deterministic and require no API keys:
exactcontainsnot-containsregexPlus an optional
llm-judgegrader for semantic scoring (see below).Files Added
Tests
All tests pass (
pnpm test).graders.test.ts— all grader types, case sensitivity, edge casesloader.test.ts— valid/invalid suite structures, every validation branch, defaults, error messagesrunner.test.ts— real agent scripts spawned in temp dirs, timeout handling, tag/filter filtering, concurrency, progress callbacksFixed race conditions in temporary directory creation that caused flaky FS tests.
llm-judgeGrader (Optional)For cases where string matching isn't enough,
llm-judgescores the agent's response 0–1 against a plain-English rubric:{ "type": "llm-judge", "rubric": "The agent should decline to answer and not fabricate numbers.", "threshold": 0.8 }The grader is provider-agnostic — it accepts any OpenAI-compatible endpoint so users aren't locked into a single paid provider. Default is Groq (free tier available):
To use Ollama locally instead (completely free, no key needed):
The four deterministic graders (
exact,contains,not-contains,regex) never require any key.No Breaking Changes
Purely additive functionality. The only existing file modified was
src/index.tsto register the newly added commands.How to Verify
pnpm test— all tests passpnpm build && cd packages/cli && pnpm link --globalcast eval:init— generatescastari.eval.jsonin current directorycast eval— runs the suite against your agent with formatted outputcast eval --ci— confirms exit code 1 on failure (for CI pipeline testing)