Skip to content

feat: add cast eval command for agent evaluation pipelines#9

Open
Abhiram-Vengala wants to merge 1 commit into
castari:mainfrom
Abhiram-Vengala:feat/cast-eval
Open

feat: add cast eval command for agent evaluation pipelines#9
Abhiram-Vengala wants to merge 1 commit into
castari:mainfrom
Abhiram-Vengala:feat/cast-eval

Conversation

@Abhiram-Vengala
Copy link
Copy Markdown

Summary

This PR introduces cast eval — an evaluation pipeline that lets developers run a test suite against their agent before deploying, and gate CI on the results. It also adds cast eval:init to scaffold a starter castari.eval.json in the current directory.

Evaluation pipelines are now a standard layer in AI agent frameworks. Without this, there's no way to catch regressions between deploys — an agent that worked yesterday could silently break today.


New Commands

Command Description
cast eval Runs the evaluation suite against your agent
cast eval:init Scaffolds a castari.eval.json starter file

How It Works

Define test cases in castari.eval.json at your project root:

{
  "name": "My Agent Suite",
  "cases": [
    {
      "name": "returns valid Python",
      "input": "write a function that adds two numbers",
      "assert": [
        { "type": "contains", "expected": "def " },
        { "type": "regex", "pattern": "def \\w+\\(" },
        { "type": "not-contains", "expected": "SyntaxError" }
      ]
    }
  ]
}

Then run:

cast eval                          # interactive, with spinner
cast eval --ci                     # CI mode, exits 1 on any failure
cast eval --tag smoke              # run only cases tagged "smoke"
cast eval --filter "greeting"      # run cases matching name substring
cast eval --concurrency 4          # run 4 cases in parallel
cast eval --output report.json     # write structured JSON report

Grader Types

All four graders are deterministic and require no API keys:

Type Description
exact Response must equal the expected string
contains Response must contain the expected string
not-contains Response must NOT contain the expected string
regex Response must match the pattern

Plus an optional llm-judge grader for semantic scoring (see below).


Files Added

packages/cli/src/
├── commands/
│   └── eval.ts          ← cast eval + cast eval:init commands
└── eval/
    ├── types.ts         ← shared TypeScript types for test suites
    ├── graders.ts       ← grading logic (exact, contains, regex, llm-judge)
    ├── loader.ts        ← config loading, validation, default-filling
    ├── runner.ts        ← async worker queue, spawn, timeout, concurrency
    └── reporter.ts      ← TTY-aware spinner, pass/fail summary, JSON report

Tests

All tests pass (pnpm test).

  • graders.test.ts — all grader types, case sensitivity, edge cases
  • loader.test.ts — valid/invalid suite structures, every validation branch, defaults, error messages
  • runner.test.ts — real agent scripts spawned in temp dirs, timeout handling, tag/filter filtering, concurrency, progress callbacks

Fixed race conditions in temporary directory creation that caused flaky FS tests.


llm-judge Grader (Optional)

For cases where string matching isn't enough, llm-judge scores the agent's response 0–1 against a plain-English rubric:

{
  "type": "llm-judge",
  "rubric": "The agent should decline to answer and not fabricate numbers.",
  "threshold": 0.8
}

The grader is provider-agnostic — it accepts any OpenAI-compatible endpoint so users aren't locked into a single paid provider. Default is Groq (free tier available):

# Get a free key at https://console.groq.com
CASTARI_API_KEY=your-groq-key-here

# Optional overrides
CASTARI_LLM_BASE_URL=https://api.groq.com/openai/v1   # default
CASTARI_LLM_MODEL=llama-3.1-8b-instant                # default

To use Ollama locally instead (completely free, no key needed):

CASTARI_LLM_BASE_URL=http://localhost:11434/v1
CASTARI_LLM_MODEL=llama3

The four deterministic graders (exact, contains, not-contains, regex) never require any key.


No Breaking Changes

Purely additive functionality. The only existing file modified was src/index.ts to register the newly added commands.


How to Verify

  1. pnpm test — all tests pass
  2. pnpm build && cd packages/cli && pnpm link --global
  3. cast eval:init — generates castari.eval.json in current directory
  4. cast eval — runs the suite against your agent with formatted output
  5. cast eval --ci — confirms exit code 1 on failure (for CI pipeline testing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant