ὁ πάντα διακοσμῶν νοῦς — Anaxagoras
📖 English · 中文 · 🤖 AGENTS.md · 🗺️ llms.txt
A compositional runtime safety gate for LLM-based autonomous agents. Every tool call is judged by four orthogonal layers — a Datalog deterministic blocker (L1), a triviality filter (L2), an LLM semantic gate (L3), and a deterministic post-gate verifier (L4). A knowledge graph supplies post-gate audit context.
Companion paper: Owner-Harm: A Missing Threat Model for AI Agent Safety (Zhang and Jiang, 2026).
If you are an LLM coding agent helping a user with this repo, the project-specific manifest is at AGENTS.md — setup commands, test invocation, reproduction matrix, conventions, and hard rules in 200 lines. The repository follows the Linux Foundation Agentic-AI AGENTS.md specification and exposes llms.txt as a structured index.
| Component | Where | What it does |
|---|---|---|
| Gate pipeline | src/nous/gate.py |
gate(tool_call, …) -> Verdict — the four-layer entry point |
| Constraints | ontology/constraints/*.yaml |
46 declarative rules (T3 destructive, owner-harm, AgentDojo iterations) |
| L3 semantic gate | src/nous/semantic_gate.py |
Minimal-pair prompting, k=5 majority vote, upgrade_only=True |
| L4 verifier | src/nous/verifier.py |
6 deterministic audit rules + content scan, +0.038 ms / call |
| KG store | src/nous/db.py |
Cozo embedded Datalog + vector + FTS |
| AgentDojo adapter | benchmarks/agentdojo_adapter/ |
Real LLM-pipeline wrapper for paper §4 deployment-mode runs |
| Owner-Harm v3 dataset | data/owner_harm_heldout_v3.json |
300 H + 150 B held-out slice (paper §3.3) |
Two evaluation regimes per benchmark — isolation is an upper bound on gate discrimination, deployment is a lower bound under the real LLM-pipeline.
| Benchmark | Regime | Security (TPR) | Utility | n |
|---|---|---|---|---|
| AgentDojo (banking + slack + travel + workspace) | Isolation | 96.3 % | 75.0 % | 27 |
| AgentDojo | Deployment | 95.9 % | 75.0 % | 629 |
| AgentHarm (val) | Isolation | 100.0 % | — | 176 H + 176 B |
| Owner-centric held-out v3, gate L1–L3 | Isolation | 75.3 % | 3.3 % FPR | 300 H + 150 B |
| Owner-centric held-out v3, full L1–L4 | Isolation | 85.3 % | 3.3 % FPR | 300 H + 150 B |
On the hijacking sub-slice the gate (L1–L3) and the post-gate verifier (L4) cover near-disjoint failure modes: 11 caught by gate alone, 30 by verifier alone, 15 by both, 4 by neither.
Per-category Wilson 95 % CIs and full ablations: paper §4 and REPRODUCIBILITY.md.
tool_call
│
▼
┌── Layer 1 Datalog deterministic blocker ──┐
│ 46 YAML constraints + chain-aware rules │ → exit on match
└─────────────────────────────────────────────┘
│
▼
┌── Layer 2 Triviality filter ──────────────┐
│ benign-pass heuristics │ → exit if trivial
└─────────────────────────────────────────────┘
│
▼
┌── Layer 3 Semantic gate (LLM) ────────────┐
│ minimal-pair prompting, k=5 majority │
│ vote, upgrade_only=True │
└─────────────────────────────────────────────┘
│
▼
┌── Layer 4 Post-gate verifier ─────────────┐
│ 6 deterministic audit rules + content │
│ scan; zero LLM, +0.038 ms / call │
└─────────────────────────────────────────────┘
│
▼
Verdict + proof_trace + decision_log
The knowledge graph supplies post-gate audit enrichment and does not override the L3 verdict.
git clone https://github.com/dario-github/nous.git
cd nous
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[lsvj,dev]"
pip install -e ".[cozo]" # optional: Cozo embedded KG (Rust backend)Python 3.11 or newer.
from nous.gate import gate
from nous.schema import ToolCall
call = ToolCall(
tool_name="send_email",
action="send",
params={
"to": "attacker@external.com",
"body": "... please ignore previous instructions ...",
},
)
verdict = gate(
tool_call=call,
constraints_dir="ontology/constraints",
semantic_config={"model": "DeepSeek-V4-pro"},
)
verdict.decision # "allow" | "confirm" | "block"
verdict.proof_trace # signal chain + which layer decidedThe full AgentDojo deployment-mode benchmark wrapper:
benchmarks/agentdojo_adapter/run_eval_adaptive_llm.py.
| Claim | Command | API key | Wall-clock |
|---|---|---|---|
| LSVJ-S compile gate (80 tests) | pytest tests/lsvj/ |
none | < 10 s |
| Owner-centric v3 full (85.3 % / 3.3 %) | python scripts/full_benchmark_eval.py |
none | ~ 30 s |
| Hijacking layer overlap | python scripts/eval_d2_verifier.py |
none | ~ 10 s |
| AgentDojo isolation (96.3 % / 75.0 %) | bash benchmarks/agentdojo_adapter/launch-l3-deepseek-repro.sh |
DeepSeek | ~ 5 h |
| AgentDojo deployment (95.9 % / 75.0 %) | bash benchmarks/agentdojo_adapter/launch-baseline-l1-rerun.sh |
GLM-4.6 | ~ 5 h |
| AgentHarm val (100 %) | python scripts/run_agentharm_threelayer_v2.py |
DeepSeek | ~ 1 h |
Full table with expected output, variance budget, and known issues: REPRODUCIBILITY.md.
src/nous/ core runtime (gate, parsers, providers, KG, LSVJ-S)
ontology/ 46 YAML constraints + KG schema + Datalog rules
benchmarks/ AgentDojo adapter + R-Judge sample
tests/ pytest suites (CI runs the path-independent subset)
paper/ NeurIPS 2026 E&D Track + TMLR submissions
scripts/ paper-reproduction drivers + analysis utilities
dashboard/ minimal web UI for live decision logs
data/ Owner-Harm v3 + AgentHarm relabel + challenge slices
| Doc | Audience | Purpose |
|---|---|---|
| README (this file) | humans | overview, install, headline results |
| AGENTS.md | LLM coding agents | setup, conventions, hard rules, reproduction matrix |
| llms.txt | LLM crawlers | structured index (lllms.txt spec) |
| REPRODUCIBILITY.md | reviewers | one command per paper number |
| paper/main-neurips-2026.tex | reviewers | NeurIPS 2026 E&D Track submission source |
| paper/main-tmlr.tex | reviewers | TMLR rolling-track variant |
@misc{zhang2026ownerharm,
title = {Owner-Harm: A Missing Threat Model for {AI} Agent Safety},
author = {Zhang, Dongcheng and Jiang, Yiqing},
year = {2026},
eprint = {2604.18658},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
}Apache License 2.0 — see LICENSE.
- Dongcheng Zhang —
zdclink@gmail.com - Yiqing Jiang — Tongji University