feat: add cast eval command for agent evaluation pipelines by Abhiram-Vengala · Pull Request #9 · castari/cli

Abhiram-Vengala · 2026-03-30T11:34:03Z

Summary

This PR introduces cast eval — an evaluation pipeline that lets developers run a test suite against their agent before deploying, and gate CI on the results. It also adds cast eval:init to scaffold a starter castari.eval.json in the current directory.

Evaluation pipelines are now a standard layer in AI agent frameworks. Without this, there's no way to catch regressions between deploys — an agent that worked yesterday could silently break today.

New Commands

Command	Description
`cast eval`	Runs the evaluation suite against your agent
`cast eval:init`	Scaffolds a `castari.eval.json` starter file

How It Works

Define test cases in castari.eval.json at your project root:

{
  "name": "My Agent Suite",
  "cases": [
    {
      "name": "returns valid Python",
      "input": "write a function that adds two numbers",
      "assert": [
        { "type": "contains", "expected": "def " },
        { "type": "regex", "pattern": "def \\w+\\(" },
        { "type": "not-contains", "expected": "SyntaxError" }
      ]
    }
  ]
}

Then run:

cast eval                          # interactive, with spinner
cast eval --ci                     # CI mode, exits 1 on any failure
cast eval --tag smoke              # run only cases tagged "smoke"
cast eval --filter "greeting"      # run cases matching name substring
cast eval --concurrency 4          # run 4 cases in parallel
cast eval --output report.json     # write structured JSON report

Grader Types

All four graders are deterministic and require no API keys:

Type	Description
`exact`	Response must equal the expected string
`contains`	Response must contain the expected string
`not-contains`	Response must NOT contain the expected string
`regex`	Response must match the pattern

Plus an optional llm-judge grader for semantic scoring (see below).

Files Added

packages/cli/src/
├── commands/
│   └── eval.ts          ← cast eval + cast eval:init commands
└── eval/
    ├── types.ts         ← shared TypeScript types for test suites
    ├── graders.ts       ← grading logic (exact, contains, regex, llm-judge)
    ├── loader.ts        ← config loading, validation, default-filling
    ├── runner.ts        ← async worker queue, spawn, timeout, concurrency
    └── reporter.ts      ← TTY-aware spinner, pass/fail summary, JSON report

Tests

All tests pass (pnpm test).

graders.test.ts — all grader types, case sensitivity, edge cases
loader.test.ts — valid/invalid suite structures, every validation branch, defaults, error messages
runner.test.ts — real agent scripts spawned in temp dirs, timeout handling, tag/filter filtering, concurrency, progress callbacks

Fixed race conditions in temporary directory creation that caused flaky FS tests.

`llm-judge` Grader (Optional)

For cases where string matching isn't enough, llm-judge scores the agent's response 0–1 against a plain-English rubric:

{
  "type": "llm-judge",
  "rubric": "The agent should decline to answer and not fabricate numbers.",
  "threshold": 0.8
}

The grader is provider-agnostic — it accepts any OpenAI-compatible endpoint so users aren't locked into a single paid provider. Default is Groq (free tier available):

# Get a free key at https://console.groq.com
CASTARI_API_KEY=your-groq-key-here

# Optional overrides
CASTARI_LLM_BASE_URL=https://api.groq.com/openai/v1   # default
CASTARI_LLM_MODEL=llama-3.1-8b-instant                # default

To use Ollama locally instead (completely free, no key needed):

CASTARI_LLM_BASE_URL=http://localhost:11434/v1
CASTARI_LLM_MODEL=llama3

The four deterministic graders (exact, contains, not-contains, regex) never require any key.

No Breaking Changes

Purely additive functionality. The only existing file modified was src/index.ts to register the newly added commands.

How to Verify

pnpm test — all tests pass
pnpm build && cd packages/cli && pnpm link --global
cast eval:init — generates castari.eval.json in current directory
cast eval — runs the suite against your agent with formatted output
cast eval --ci — confirms exit code 1 on failure (for CI pipeline testing)

feat: add cast eval command for agent evaluation pipelines

69d495d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add cast eval command for agent evaluation pipelines#9

feat: add cast eval command for agent evaluation pipelines#9
Abhiram-Vengala wants to merge 1 commit into
castari:mainfrom
Abhiram-Vengala:feat/cast-eval

Abhiram-Vengala commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abhiram-Vengala commented Mar 30, 2026

Summary

New Commands

How It Works

Grader Types

Files Added

Tests

llm-judge Grader (Optional)

No Breaking Changes

How to Verify

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`llm-judge` Grader (Optional)