GPU Kill - Detailed Documentation

Architecture
Installation
Command Reference
Enhanced Features
Audit System
Suspicious Usage Detection
Guard Mode
Cluster Management
Remote Operations
Dashboard
MCP Server
Output Formats
Configuration
Safety Features
Exit Codes
Troubleshooting
Development

Architecture

GPU Kill is built with Rust and supports NVIDIA, AMD, Intel, and Apple Silicon GPUs through vendor-specific interfaces. The tool is designed with safety, usability, multi-vendor support, and distributed cluster management in mind.

Core Components

CLI Parser: Uses clap for robust argument parsing and validation
Vendor Abstraction: Multi-vendor GPU support (NVIDIA, AMD, Intel, Apple Silicon)
NVML Wrapper: Interfaces with NVIDIA's management library
ROCm Interface: AMD GPU management via rocm-smi
Intel GPU Tools: Intel GPU management via intel_gpu_top
Apple Silicon Interface: Apple Silicon GPU management via system_profiler and system APIs
Enhanced Process Manager: Advanced process filtering and batch operations
Container Detection: Container-aware process management
Audit System: Automatic GPU usage tracking and historical analysis
Renderer: Formats output as tables or JSON
Configuration: Supports file and environment-based configuration
Coordinator API: RESTful API server for cluster management
SSH Remote Manager: Secure remote GPU management via SSH
MCP Server: Model Context Protocol server for AI assistant integration

Dependencies

clap: Command-line argument parsing
nvml-wrapper: NVIDIA GPU management
sysinfo: System process information
regex: Process filtering with regex patterns
tabled: Table formatting
serde: JSON serialization
serde_json: JSON parsing and generation
chrono: Date and time handling for audit timestamps
dirs: Cross-platform data directory management
tracing: Structured logging
color-eyre: Error handling
axum: HTTP server framework for coordinator API
tower: HTTP middleware and services
tower-http: HTTP middleware (CORS, tracing)
uuid: Unique identifier generation for nodes
jsonrpc-core: JSON-RPC protocol implementation for MCP server
jsonrpc-ws-server: WebSocket JSON-RPC server for MCP
futures-util: Async utilities for WebSocket handling
tokio: Async runtime for HTTP server and WebSocket

Installation

Prerequisites

NVIDIA GPU Support:
- NVIDIA GPU with supported drivers
- NVIDIA Management Library (NVML) - included with NVIDIA drivers
AMD GPU Support:
- AMD GPU with ROCm drivers installed
- rocm-smi command-line tool available
Intel GPU Support:
- Intel GPU with intel-gpu-tools package installed
- intel_gpu_top command-line tool available
Apple Silicon GPU Support:
- macOS with Apple Silicon (M1, M2, M3, M4)
- system_profiler command-line tool available
General Requirements:
- Rust 1.70+ (for building from source)
- Linux, macOS, or Windows

Install from Cargo

cargo install gpukill

Build from Source

git clone https://github.com/treadiehq/gpu-kill.git
cd gpukill
cargo build --release

Cross-compilation

The project supports cross-compilation for different platforms:

# For Linux from macOS
cargo build --release --target x86_64-unknown-linux-gnu

# For Windows from Linux
cargo build --release --target x86_64-pc-windows-gnu

Command Reference

Global Options

Option	Description	Default
`--log-level <LEVEL>`	Set logging level	`info`
`--config <PATH>`	Configuration file path	None
`--remote <HOST>`	Remote host to connect to via SSH	None
`--ssh-user <USER>`	SSH username (requires --remote)	Current user
`--ssh-port <PORT>`	SSH port (requires --remote)	`22`
`--ssh-key <PATH>`	SSH private key path (requires --remote)	None
`--ssh-password <PASSWORD>`	SSH password (requires --remote)	Interactive prompt
`--ssh-timeout <SECONDS>`	SSH connection timeout (requires --remote)	`30`
`--register-node <URL>`	Register this node with a coordinator	None
`--help`	Show help information	-
`--version`	Show version information	-

List Operation

gpukill --list [OPTIONS]

Options:

--details: Show detailed per-process information
--watch: Refresh output every 2 seconds until Ctrl-C
--output <FORMAT>: Output format (table or json)
--vendor <VENDOR>: Filter by GPU vendor (nvidia, amd, intel, apple, all)

Examples:

# Basic listing
gpukill --list

# With process details
gpukill --list --details

# Watch mode
gpukill --list --watch

# JSON output
gpukill --list --output json

# Combined options
gpukill --list --details --watch --output json

# Filter by vendor
gpukill --list --vendor nvidia
gpukill --list --vendor amd --details
gpukill --list --vendor apple --watch

Kill Operation

gpukill --kill (--pid <PID> | --filter <PATTERN>) [OPTIONS]

Required (one of):

--pid <PID>: Process ID to terminate
--filter <PATTERN>: Filter processes by name pattern (supports regex)

Options:

--timeout-secs <SECONDS>: Timeout before escalation (default: 5)
--force: Escalate to SIGKILL after timeout
--batch: Kill multiple processes matching the filter (requires --filter)

Examples:

# Graceful termination of a single process
gpukill --kill --pid 12345

# Custom timeout for a single process
gpukill --kill --pid 12345 --timeout-secs 10

# Force escalation for a single process
gpukill --kill --pid 12345 --force

# Kill processes matching a pattern
gpukill --kill --filter "python.*"

# Batch kill all processes matching a pattern
gpukill --kill --filter "python.*" --batch --force

Reset Operation

gpukill --reset [--gpu <ID> | --all] [OPTIONS]

Required (one of):

--gpu <ID>: Specific GPU ID to reset
--all: Reset all GPUs

Options:

--force: Force reset even with active processes

Examples:

# Reset specific GPU
gpukill --reset --gpu 0

# Reset all GPUs
gpukill --reset --all

# Force reset
gpukill --reset --gpu 0 --force

Audit Operation

gpukill --audit [OPTIONS]

Options:

--audit-user <USER>: Filter by specific user
--audit-process <PATTERN>: Filter by process name pattern
--audit-hours <HOURS>: Show records from last N hours (default: 24)
--audit-summary: Show summary statistics instead of detailed records

Suspicious Usage Detection

gpukill --audit --rogue [OPTIONS]

Detection Options:

--rogue: Perform rogue activity detection
--rogue-config: Show current detection configuration
--rogue-memory-threshold <GB>: Set memory usage threshold
--rogue-utilization-threshold <PERCENT>: Set GPU utilization threshold
--rogue-duration-threshold <HOURS>: Set process duration threshold
--rogue-confidence-threshold <CONFIDENCE>: Set minimum confidence for detection

Whitelist Management:

--rogue-whitelist-process <NAME>: Add process to whitelist
--rogue-unwhitelist-process <NAME>: Remove process from whitelist
--rogue-whitelist-user <USERNAME>: Add user to whitelist
--rogue-unwhitelist-user <USERNAME>: Remove user from whitelist

Configuration Management:

--rogue-export-config: Export configuration to JSON
--rogue-import-config <FILE>: Import configuration from JSON file

Examples:

# Detect suspicious activity
gpukill --audit --rogue --audit-hours 48

# View current configuration
gpukill --audit --rogue-config

# Update detection thresholds
gpukill --audit --rogue-memory-threshold 15.0 --rogue-utilization-threshold 90.0

# Manage whitelists
gpukill --audit --rogue-whitelist-process "my-app"
gpukill --audit --rogue-whitelist-user "developer"

# Export/import configuration
gpukill --audit --rogue-export-config > config.json
gpukill --audit --rogue-import-config config.json

Server Operation

gpukill --server [OPTIONS]

Options:

--server-port <PORT>: Port for coordinator API (default: 8080)
--server-host <HOST>: Host to bind coordinator API (default: 0.0.0.0)

Description: Starts the GPU Kill coordinator server that provides:

RESTful API for cluster management
WebSocket server for real-time updates
Node registration and heartbeat management
Magic Moment contention analysis

Examples:

# Start coordinator on default port 8080
gpukill --server

# Start coordinator on custom port
gpukill --server --server-port 9000

# Start coordinator on all interfaces
gpukill --server --server-host 0.0.0.0

Node Registration Operation

gpukill --register-node <COORDINATOR_URL>

Description: Registers this node with a coordinator server for cluster management:

Generates unique node ID
Sends periodic GPU snapshots to coordinator
Maintains heartbeat for health monitoring
Enables cluster-wide monitoring and management

Examples:

# Register with default coordinator
gpukill --register-node http://coordinator:8080

# Register with custom coordinator
gpukill --register-node http://gpu-cluster:9000

# Register with HTTPS coordinator
gpukill --register-node https://secure-cluster:8443

Enhanced Features

Multi-Vendor Support

gpukill automatically detects and utilizes available GPU vendors (NVIDIA, AMD, Intel). You can also filter the displayed information by vendor.

Options:

--vendor <VENDOR>: Filter by GPU vendor.
- nvidia: Show only NVIDIA GPUs.
- amd: Show only AMD GPUs.
- intel: Show only Intel GPUs.
- apple: Show only Apple Silicon GPUs.
- all: Show all detected GPUs (default if --vendor is not specified).

Examples:

# List only NVIDIA GPUs
gpukill --list --vendor nvidia

# List only AMD GPUs with details
gpukill --list --vendor amd --details

# List only Intel GPUs
gpukill --list --vendor intel

# Monitor all GPUs in watch mode
gpukill --list --vendor all --watch

Vendor Detection:

NVIDIA: Automatically detected if NVML is available
AMD: Automatically detected if rocm-smi is available
Intel: Automatically detected if intel_gpu_top is available
Apple Silicon: Automatically detected if running on macOS with Apple Silicon
Mixed Systems: Supports systems with multiple GPU vendors

Advanced Process Filtering

The --kill command now supports filtering processes by name using regular expressions, enabling powerful batch operations.

Options:

--filter <PATTERN>: A regular expression pattern to match against process names.
--batch: When used with --filter, all matching processes will be targeted for termination. Without --batch, gpukill will list matching processes and warn you to use --batch to proceed with killing.

Examples:

# List processes matching "python" (case-sensitive)
gpukill --list --details --filter "python"

# Kill all processes whose names start with "tensor"
gpukill --kill --filter "^tensor" --batch --force

# Find and kill all processes related to "jupyter"
gpukill --kill --filter "jupyter" --batch

Container-Aware Process Detection

gpukill can now attempt to identify if a process is running within a container environment.

Options:

--containers: When used with --list, an additional column or field will indicate if a process is running in a container (e.g., Docker, LXC, Kubernetes).

Examples:

# List GPUs and show container info for processes
gpukill --list --details --containers

# Watch containerized processes on NVIDIA GPUs
gpukill --list --watch --containers --vendor nvidia

Apple Silicon Specific Features

Apple Silicon GPUs use unified memory architecture, which provides unique capabilities:

# Monitor Apple Silicon GPU
gpukill --list --vendor apple

# Monitor with details
gpukill --list --vendor apple --details

# Watch Apple Silicon GPU usage
gpukill --list --vendor apple --watch

Apple Silicon Characteristics:

Unified Memory: GPU and CPU share the same memory pool
Memory Estimation: GPU memory usage is estimated from active system memory
Process Detection: Identifies Metal, OpenGL, and ML framework processes
No Temperature/Power: These metrics are not available via system APIs
No Reset Support: GPU reset requires kernel-level operations

Audit System

The audit system automatically tracks GPU usage history whenever you run gpukill --list. This provides valuable insights into GPU utilization patterns, resource planning, and troubleshooting.

How It Works

Automatic Data Collection:

Every gpukill --list command automatically logs GPU usage data
Data is stored in JSON Lines format for easy processing
No additional configuration required - works out of the box

Data Storage:

Linux: ~/.local/share/gpukill/audit.jsonl
macOS: ~/Library/Application Support/gpukill/audit.jsonl
Windows: %APPDATA%\gpukill\audit.jsonl

Data Captured:

Timestamp of each GPU check
GPU information (index, name, memory usage, utilization, temperature, power)
Process information (when processes are using GPU)
Container information (when available)
User information (when processes are detected)

Audit Commands

Basic Audit Queries:

# Show last 24 hours of GPU usage
gpukill --audit

# Show last 6 hours
gpukill --audit --audit-hours 6

# Show last 3 days
gpukill --audit --audit-hours 72

Filtered Queries:

# Show only specific user's GPU usage
gpukill --audit --audit-user john

# Show only specific process types
gpukill --audit --audit-process python
gpukill --audit --audit-process tensorflow

# Combine filters
gpukill --audit --audit-user alice --audit-process pytorch --audit-hours 12

Summary Reports:

# Get usage summary for last 24 hours
gpukill --audit --audit-summary

# Get summary for last week
gpukill --audit --audit-summary --audit-hours 168

Suspicious Usage Detection

The suspicious usage detection system provides comprehensive security monitoring for GPU resources, detecting crypto miners, suspicious processes, and resource abuse patterns.

Detection Capabilities

Crypto Miner Detection:

Identifies known mining software (xmrig, ccminer, ethminer, etc.)
Detects mining patterns in process names and behavior
Analyzes high GPU utilization and sustained usage
Provides confidence-based scoring for mining activity

Suspicious Process Detection:

Flags unusual process names and patterns
Detects excessive resource usage
Identifies processes from unusual users
Analyzes process behavior over time

Resource Abuse Detection:

Memory hogs consuming excessive GPU memory
Long-running processes that may be stuck
Excessive GPU utilization patterns
Unauthorized access attempts

Risk Assessment:

Confidence-based threat scoring (0.0 - 1.0)
Risk level classification (Low, Medium, High, Critical)
Weighted scoring for different threat types
Actionable recommendations for each threat

Configuration System

Configuration File:

Location: ~/.config/gpukill/rogue_config.toml
Format: TOML with comprehensive detection rules
Auto-creation: Default configuration created on first use
Version tracking: Metadata includes version and modification timestamps

Detection Thresholds:

[detection]
max_memory_usage_gb = 20.0        # Maximum memory usage threshold
max_utilization_pct = 95.0        # Maximum GPU utilization threshold
max_duration_hours = 24.0         # Maximum process duration threshold
min_confidence_threshold = 0.7    # Minimum confidence for detection

Pattern Matching:

[patterns]
crypto_miner_patterns = ["cuda", "opencl", "miner", "hash"]
suspicious_process_names = ["xmrig", "ccminer", "ethminer"]
user_whitelist = ["root", "admin", "system"]
process_whitelist = ["python", "jupyter", "tensorflow"]

Risk Scoring:

[scoring.threat_weights]
crypto_miner = 0.8
suspicious_process = 0.6
resource_abuser = 0.3
data_exfiltrator = 0.9

[scoring.risk_thresholds]
critical = 0.9
high = 0.7
medium = 0.5
low = 0.3

Detection Examples

Basic Detection:

# Scan for suspicious activity in last 24 hours
gpukill --audit --rogue

# Scan last 48 hours with JSON output
gpukill --audit --rogue --audit-hours 48 --output json

Configuration Management:

# View current configuration
gpukill --audit --rogue-config

# Update thresholds
gpukill --audit --rogue-memory-threshold 15.0
gpukill --audit --rogue-utilization-threshold 90.0

# Manage whitelists
gpukill --audit --rogue-whitelist-process "my-app"
gpukill --audit --rogue-whitelist-user "developer"

Configuration Export/Import:

# Export configuration
gpukill --audit --rogue-export-config > security-config.json

# Import configuration
gpukill --audit --rogue-import-config security-config.json

JSON Output:

# Export audit data as JSON for external processing
gpukill --audit --output json

# Export filtered data
gpukill --audit --audit-user john --output json > john_gpu_usage.json

Dashboard

The suspicious usage detection is fully integrated with the dashboard:

Check the Kill Suite website

Audit Data Structure

Each audit record contains:

{
  "id": 1758236888745,
  "timestamp": "2025-09-18T23:08:08.745114Z",
  "gpu_index": 0,
  "gpu_name": "Apple M3 Max",
  "pid": null,
  "user": null,
  "process_name": null,
  "memory_used_mb": 3216,
  "utilization_pct": 0.0,
  "temperature_c": 0,
  "power_w": 0.0,
  "container": null
}

Field Descriptions:

id: Unique identifier (timestamp + process ID)
timestamp: ISO 8601 timestamp of the measurement
gpu_index: GPU device index
gpu_name: Human-readable GPU name
pid: Process ID (null for GPU-level records)
user: Username (null for GPU-level records)
process_name: Process name (null for GPU-level records)
memory_used_mb: Memory usage in megabytes
utilization_pct: GPU utilization percentage
temperature_c: GPU temperature in Celsius
power_w: GPU power consumption in watts
container: Container name (null if not in container)

Use Cases

Resource Planning:

# Analyze peak usage patterns
gpukill --audit --audit-summary --audit-hours 168

# Find heavy users
gpukill --audit --audit-summary | grep "Top Users"

Troubleshooting:

# Check what was running when GPU crashed
gpukill --audit --audit-hours 1

# Find processes that used most memory
gpukill --audit --audit-summary | grep "Top Processes"

Compliance and Billing:

# Generate usage report for specific user
gpukill --audit --audit-user alice --output json > alice_usage.json

# Export all data for analysis
gpukill --audit --output json > gpu_usage_export.json

Performance Analysis:

# Check hourly usage patterns
gpukill --audit --audit-summary --audit-hours 24

# Monitor specific application usage
gpukill --audit --audit-process tensorflow --audit-hours 48

Data Management

File Size:

Each audit record is approximately 200-300 bytes
1000 records ≈ 250KB
10,000 records ≈ 2.5MB
Automatic cleanup recommended for long-term usage

Retention:

No automatic cleanup by default
Manual cleanup: Delete old records from audit.jsonl
Recommended: Keep 30-90 days of data depending on needs

Backup:

# Backup audit data
cp ~/.local/share/gpukill/audit.jsonl gpu_audit_backup.jsonl

# Restore audit data
cp gpu_audit_backup.jsonl ~/.local/share/gpukill/audit.jsonl

Cluster Management

GPU Kill includes a powerful cluster management system that allows you to monitor and manage multiple GPU nodes from a central coordinator.

Coordinator API

The coordinator is a RESTful API server that aggregates data from multiple GPU nodes and provides real-time cluster monitoring.

Starting the Coordinator

# Start coordinator on default port 8080
gpukill --server

# Start on custom port
gpukill --server --server-port 9000

# Start on all interfaces
gpukill --server --server-host 0.0.0.0

API Endpoints

GET /api/nodes - List all registered nodes
POST /api/nodes/:id/register - Register a new node
POST /api/nodes/:id/snapshot - Update node snapshot
GET /api/cluster/snapshot - Get cluster-wide snapshot
GET /api/cluster/contention - Get GPU contention analysis
WS /ws - WebSocket for real-time updates

Node Registration

Nodes automatically register themselves when they start the coordinator. Each node:

Generates a unique UUID
Reports hostname and IP address
Sends periodic snapshots of GPU and process data
Maintains heartbeat for health monitoring

Magic Moment Analysis

The "Magic Moment" feature provides instant identification of GPU contention and resource blocking:

Blocked GPUs: GPUs with high utilization that are blocking other users
Top Users: Users ranked by GPU memory usage and utilization
Contention Recommendations: Suggestions for optimizing GPU allocation
Real-time Updates: Live updates via WebSocket connections

Remote Operations

GPU Kill supports SSH-based remote management, allowing you to control GPUs across distributed systems.

SSH Configuration

# Basic remote connection
gpukill --remote staging-server --list

# With custom SSH options
gpukill --remote server --ssh-user admin --ssh-port 2222 --list
gpukill --remote server --ssh-key ~/.ssh/id_rsa --list
gpukill --remote server --ssh-password mypassword --list

Remote Authentication

SSH Keys: Preferred method for automated operations
Password: Interactive or provided via command line
SSH Agent: Uses system SSH agent for key management
Custom Ports: Support for non-standard SSH ports

Remote Requirements

SSH access to remote host
gpukill installed on remote host
Proper SSH key permissions (chmod 600 ~/.ssh/id_rsa)
Network connectivity to remote host

Remote Operations

All local operations work remotely:

# Remote monitoring
gpukill --remote server --list --details --watch

# Remote process management
gpukill --remote server --kill --pid 1234
gpukill --remote server --kill --filter "python.*" --batch

# Remote GPU control
gpukill --remote server --reset --gpu 0
gpukill --remote server --reset --all

# Remote auditing
gpukill --remote server --audit --audit-summary

Dashboard

The GPU Kill dashboard is a modern web interface built with Nuxt.js and Tailwind CSS for real-time cluster monitoring.

Check the Kill Suite website.

MCP Server

GPU Kill includes a MCP server that enables AI assistants and other tools to interact with GPU management functionality through a standardized interface.

Overview

The MCP server provides a JSON-RPC interface that allows AI assistants to:

Monitor GPU health and performance
Kill problematic processes
Reset crashed GPUs
Scan for security threats
Manage resource policies
Automate GPU operations

Architecture

The MCP server is built as a separate crate (gpukill-mcp) that integrates with the main GPU Kill functionality:

HTTP Server: Runs on port 3001 (configurable via MCP_PORT)
JSON-RPC Protocol: Standard MCP protocol for AI integration
Resource Handler: Provides read-only access to GPU data
Tool Handler: Executes GPU management actions
Cross-platform: Works on macOS, Linux, and Windows

Resources

The MCP server exposes the following resources for AI assistants to read:

gpu://list

Current GPU status and utilization data including:

GPU ID, name, and vendor
Memory usage and total capacity
Utilization percentage
Temperature and power usage
Active processes

gpu://processes

Currently running GPU processes with:

Process ID and name
Memory usage
User information
GPU assignment

gpu://audit

Historical GPU usage data including:

Usage patterns over time
Process execution history
Resource utilization trends
User activity logs

gpu://policies

Current Guard Mode policies with:

User-specific limits
Group policies
GPU-specific restrictions
Time-based overrides

gpu://rogue-detection

Security scan results including:

Suspicious processes
Crypto miner detection
Resource abuse patterns
Data exfiltration attempts

Tools

The MCP server provides the following tools for AI assistants to execute:

kill_gpu_process

Kill a specific GPU process by PID:

{
  "name": "kill_gpu_process",
  "arguments": {
    "pid": 12345,
    "force": false
  }
}

reset_gpu

Reset a specific GPU:

{
  "name": "reset_gpu",
  "arguments": {
    "gpu_id": 0,
    "force": false
  }
}

scan_rogue_activity

Scan for suspicious GPU activity:

{
  "name": "scan_rogue_activity",
  "arguments": {
    "hours": 24
  }
}

create_user_policy

Create a user policy for Guard Mode:

{
  "name": "create_user_policy",
  "arguments": {
    "username": "developer",
    "memory_limit_gb": 8.0,
    "utilization_limit_pct": 70.0,
    "process_limit": 3
  }
}

get_gpu_status

Get detailed status of a specific GPU:

{
  "name": "get_gpu_status",
  "arguments": {
    "gpu_id": 0
  }
}

kill_processes_by_name

Kill all processes matching a name pattern:

{
  "name": "kill_processes_by_name",
  "arguments": {
    "pattern": "python.*train",
    "force": false
  }
}

API Endpoints

HTTP Interface

POST /mcp - Main MCP JSON-RPC endpoint
GET /health - Health check endpoint

MCP Methods

initialize - Initialize the MCP connection
resources/list - List available resources
resources/read - Read resource contents
tools/list - List available tools
tools/call - Execute a tool

Configuration

The MCP server can be configured using environment variables:

MCP_HOST - Bind address (default: 127.0.0.1). Use 127.0.0.1 for local-only access. Set to 0.0.0.0 only if you need remote access and have other protections (e.g. firewall, auth).
MCP_PORT - Port to listen on (default: 3001)
RUST_LOG - Logging level (default: info)

Usage Examples

Starting the Server

# Start the MCP server
cargo run --release -p gpukill-mcp

# Or with custom port
MCP_PORT=3001 cargo run --release -p gpukill-mcp

Testing the Server

# Health check
curl -X GET http://localhost:3001/health

# List available tools
curl -X POST http://localhost:3001/mcp \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":"1","method":"tools/list","params":{}}'

# Get GPU list
curl -X POST http://localhost:3001/mcp \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":"2","method":"resources/read","params":{"uri":"gpu://list"}}'

Development

# Run in development mode
cargo run -p gpukill-mcp

# Run with debug logging
RUST_LOG=debug cargo run -p gpukill-mcp

# Build release version
cargo build --release -p gpukill-mcp

Natural Language Examples

Ask your AI assistant to use the MCP tools with natural language:

What GPUs do I have and what's their current usage?

Kill the Python process that's stuck on GPU 0

Kill all training processes that are using too much GPU memory

Show me GPU usage and kill any stuck processes

Scan for crypto miners and suspicious activity

Create a policy to limit user memory usage to 8GB

Reset GPU 1 because it's not responding

What processes are currently using my GPUs?

For detailed MCP server documentation, see mcp/README.md.

Output Formats

Table Format (Default)

The table format provides a clean, human-readable view of GPU information:

┌─────┬──────────────────────┬─────────────────┬──────────┬──────────┬─────────┬─────────────┬──────┬─────────────────────┐
│ GPU │ NAME                 │ MEM_USED/TOTAL  │ UTIL(%)  │ TEMP(°C) │ POWER(W)│ ECC(volatile)│ PIDS │ TOP_PROC            │
├─────┼──────────────────────┼─────────────────┼──────────┼──────────┼─────────┼─────────────┼──────┼─────────────────────┤
│ 0   │ NVIDIA GeForce RTX...│ 2.0/8.0 GiB     │ 45.2     │ 72       │ 150.3   │ 0           │ 2    │ python:12345:1024MB │
└─────┴──────────────────────┴─────────────────┴──────────┴──────────┴─────────┴─────────────┴──────┴─────────────────────┘

Columns:

GPU: GPU index
NAME: GPU model name
MEM_USED/TOTAL: Memory usage in GiB
UTIL(%): GPU utilization percentage
TEMP(°C): Current temperature
POWER(W): Current power consumption
ECC(volatile): ECC error count (if available)
PIDS: Number of processes using this GPU
TOP_PROC: Highest memory-using process (format: name:pid:memory)

Detailed Table Format

When using --details, additional process rows are shown:

┌─────┬──────┬─────────┬─────────┬─────────┬─────────────┬─────────────┬──────────┐
│ GPU │ PID  │ USER    │ PROC    │ VRAM_MB │ START_TIME  │ CONTAINER   │          │
├─────┼──────┼─────────┼─────────┼─────────┼─────────────┼─────────────┼──────────┤
│ 0   │ 12345│ developer│ python  │ 1024    │ 1h 30m      │ -           │          │
│ 0   │ 12346│ developer│ python  │ 512     │ 45m         │ -           │          │
└─────┴──────┴─────────┴─────────┴─────────┴─────────────┴─────────────┴──────────┘

JSON Format

JSON output provides structured data for scripting and automation:

{
  "host": "workstation",
  "ts": "2024-01-01T12:00:00.000Z",
  "gpus": [
    {
      "gpu_index": 0,
      "name": "NVIDIA GeForce RTX 4090",
      "mem_used_mb": 2048,
      "mem_total_mb": 8192,
      "util_pct": 45.2,
      "temp_c": 72,
      "power_w": 150.3,
      "ecc_volatile": 0,
      "pids": 2,
      "top_proc": {
        "gpu_index": 0,
        "pid": 12345,
        "user": "developer",
        "proc_name": "python",
        "used_mem_mb": 1024,
        "start_time": "1h 30m",
        "container": null
      }
    }
  ],
  "procs": [
    {
      "gpu_index": 0,
      "pid": 12345,
      "user": "developer",
      "proc_name": "python",
      "used_mem_mb": 1024,
      "start_time": "1h 30m",
      "container": null
    }
  ]
}

Configuration

Configuration File

Create a configuration file at ~/.config/gpukill/config.toml:

# Logging
log_level = "info"

# Output
output_format = "table"
use_colors = true
table_width = 120

# Process management
default_timeout_secs = 5
max_processes_summary = 10

# Watch mode
watch_interval_secs = 2

# Display options
show_details = false

Environment Variables

Variable	Description	Default
`GPUKILL_LOG_LEVEL`	Log level (trace, debug, info, warn, error)	`info`
`GPUKILL_OUTPUT_FORMAT`	Output format (table, json)	`table`
`GPUKILL_DEFAULT_TIMEOUT`	Default timeout in seconds	`5`
`GPUKILL_SHOW_DETAILS`	Show detailed process information	`false`
`GPUKILL_WATCH_INTERVAL`	Watch mode refresh interval	`2`
`GPUKILL_TABLE_WIDTH`	Table width limit	`120`
`GPUKILL_USE_COLORS`	Enable/disable colored output	`true`

Configuration Precedence

Command-line arguments (highest priority)
Environment variables
Configuration file
Default values (lowest priority)

Safety Features

Process Termination Safety

Existence Validation: Verifies the target process exists before attempting termination
GPU Usage Check: Confirms the process is actually using a GPU (unless --force is used)
Graceful Shutdown: Sends SIGTERM first for clean process termination
Escalation Control: Only escalates to SIGKILL with explicit --force flag
Timeout Protection: Prevents indefinite waiting with configurable timeouts

GPU Reset Safety

Process Detection: Lists all active processes before reset
Confirmation Required: Requires --force flag if active processes are detected
Index Validation: Verifies GPU index exists before reset attempt
Operation Support: Checks if reset is supported on the target GPU
Clear Messaging: Provides detailed error messages for unsupported operations

Error Handling

Actionable Messages: Clear, specific error messages with suggested solutions
Appropriate Exit Codes: Different exit codes for different failure modes
Graceful Degradation: Continues operation when non-critical components fail
NVML Fallback: Handles cases where NVML is unavailable with helpful messages

Exit Codes

Code	Meaning	Description
`0`	Success	Operation completed successfully
`1`	General Error	Unspecified error occurred
`2`	NVML Failure	NVML initialization failed
`3`	Invalid Arguments	Command-line argument validation failed
`4`	Permission Error	Insufficient permissions for operation
`5`	Unsupported Operation	Operation not supported on this system

Troubleshooting

Common Issues

NVML Initialization Failed

Cause: NVIDIA drivers are not installed, are outdated, or NVML library is not accessible.
Solution:
- Ensure NVIDIA drivers are properly installed and up to date.
- Verify that your GPU is recognized by the system (e.g., run nvidia-smi on Linux/Windows).
- Check if you have the necessary permissions to access NVML (you might need to run gpukill with sudo for some operations).

AMD ROCm not installed

Cause: AMD ROCm drivers are not installed, or rocm-smi is not in your PATH.
Solution:
- Install ROCm drivers for your AMD GPU
- Ensure rocm-smi is accessible from your terminal
- Check if you have the necessary permissions to access AMD GPU information

Intel GPU tools not available

Cause: Intel GPU tools are not installed, or intel_gpu_top is not in your PATH.
Solution:
- Install intel-gpu-tools package for Intel GPU support
- Ensure intel_gpu_top is accessible from your terminal
- Check if you have the necessary permissions to access Intel GPU information

Permission Denied

Cause: The current user does not have the necessary privileges to perform the requested action (e.g., killing a process owned by another user, resetting a GPU).
Solution:
- For process management, ensure you have rights to manage the target PID.
- For GPU reset or other system-level operations, try running gpukill with sudo.
- Consult your system's documentation for managing user permissions for NVIDIA/AMD devices.

GPU Not Found

Cause: The specified GPU index does not exist, or the GPU is not properly detected.
Solution:
- Use gpukill --list to see available GPU indices.
- Ensure your GPU is physically connected and powered on.
- Verify that your GPU drivers are correctly installed and recognize the GPU.

No GPU vendors available

Cause: No supported GPU vendors (NVIDIA, AMD, Intel, or Apple Silicon) could be initialized or found on the system.
Solution:
- Ensure at least one supported GPU vendor's drivers and management tools are correctly installed.
- Check system logs for driver-related errors.

Process filtering not working

Cause: Invalid regex pattern or no processes match the filter.
Solution:
- Verify your regex pattern is correct
- Use gpukill --list --details to see available processes
- Test your pattern with a simple filter first

Container detection not working

Cause: Container runtime not detected or process not in container.
Solution:
- Ensure container runtime (Docker, Podman, etc.) is running
- Check if the process is actually running in a container
- Container detection is best-effort and may not work in all environments

Batch operations failing

Cause: Permission issues or processes not found.
Solution:
- Ensure you have permission to kill the target processes
- Use --force flag if processes are unresponsive
- Check that the filter pattern matches existing processes

Guard Mode

Guard Mode provides soft policy enforcement to prevent GPU resource abuse with safe testing capabilities. It allows administrators to set policies for users, groups, and GPUs, with configurable enforcement modes and comprehensive monitoring.

Overview

Guard Mode is designed to:

Prevent Resource Abuse: Set limits on memory usage, GPU utilization, and concurrent processes
Safe Testing: Dry-run mode allows testing policies without affecting running processes
Flexible Enforcement: Choose between soft warnings and hard enforcement actions
Real-time Monitoring: Live policy violation detection and alerting

Configuration

Guard Mode configuration is stored in TOML format at:

Linux: ~/.local/share/gpukill/guard_mode_config.toml
macOS: ~/Library/Application Support/gpukill/guard_mode_config.toml
Windows: %APPDATA%\gpukill\guard_mode_config.toml

Policy Types

User Policies

Control resource usage per user:

[user_policies.developer]
username = "developer"
memory_limit_gb = 8.0
utilization_limit_pct = 70.0
duration_limit_hours = 12.0
max_concurrent_processes = 3
priority = 5
allowed_gpus = []
blocked_gpus = []
time_overrides = []

Group Policies

Control resource usage per group with member management:

[group_policies.researchers]
group_name = "researchers"
total_memory_limit_gb = 32.0
total_utilization_limit_pct = 80.0
max_concurrent_processes = 10
priority = 3
allowed_gpus = [0, 1]
blocked_gpus = []
members = ["alice", "bob", "charlie"]

Key Features:

Member Management: Specify which users belong to the group
Total Resource Limits: Set aggregate limits for all group members
CLI Support: Add members via --guard-group-members "user1,user2,user3"

GPU Policies

Control access to specific GPUs with user restrictions:

[gpu_policies."0"]
gpu_index = 0
max_memory_gb = 24.0
max_utilization_pct = 90.0
reserved_memory_gb = 2.0
allowed_users = ["alice", "bob"]
blocked_users = []
maintenance_window = null

Key Features:

User Access Control: Specify which users can access specific GPUs
Reserved Memory: Set aside memory that cannot be used by processes
CLI Support: Add allowed users via --guard-gpu-allowed-users "user1,user2,user3"
Flexible Access: Leave allowed_users empty to allow all users

Time Policies

Control resource usage during specific time periods:

[[time_policies]]
name = "business_hours"
start_time = "09:00"
end_time = "17:00"
days_of_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
memory_limit_gb = 16.0
utilization_limit_pct = 80.0
max_concurrent_processes = 5

Enforcement Modes

Dry-Run Mode

Safe testing without affecting running processes:

Simulation Only: All policy violations are simulated
No Actions Taken: Processes continue running normally
Detailed Logging: Shows exactly what would happen
Safe Testing: Perfect for policy validation

Soft Enforcement

Warnings and notifications before hard actions:

Warning Notifications: Send alerts for policy violations
Grace Period: Allow time for users to adjust
Escalation: Progress to hard enforcement if violations persist
Logging: Record all violations and warnings

Hard Enforcement

Immediate action on policy violations:

Process Termination: Kill processes that violate policies
Resource Limits: Enforce memory and utilization limits
Access Control: Block access to restricted GPUs
Immediate Action: No grace period for critical violations

CLI Commands

Basic Guard Mode Operations

# Enable Guard Mode
gpukill --guard --guard-enable

# Disable Guard Mode
gpukill --guard --guard-disable

# View current configuration
gpukill --guard --guard-config

# Set dry-run mode (safe testing)
gpukill --guard --guard-dry-run

# Set enforcement mode (live enforcement)
gpukill --guard --guard-enforce

Policy Management

User Policies:

# Add user policy
gpukill --guard --guard-add-user "developer" --guard-memory-limit 8.0 --guard-utilization-limit 70.0 --guard-process-limit 3

# Remove user policy
gpukill --guard --guard-remove-user "developer"

# Update policy limits
gpukill --guard --guard-memory-limit 16.0 --guard-utilization-limit 80.0 --guard-process-limit 5

Group Policies:

# Add group policy with members
gpukill --guard --guard-add-group "developers" --guard-group-members "alice,bob,charlie" --guard-group-memory-limit 32.0 --guard-group-utilization-limit 80.0 --guard-group-process-limit 15

# Add group policy without members (empty group)
gpukill --guard --guard-add-group "testers" --guard-group-memory-limit 16.0

# Remove group policy
gpukill --guard --guard-remove-group "developers"

GPU Policies:

# Add GPU policy with allowed users
gpukill --guard --guard-add-gpu 0 --guard-gpu-allowed-users "alice,bob" --guard-gpu-memory-limit 24.0 --guard-gpu-utilization-limit 90.0 --guard-gpu-reserved-memory 2.0

# Add GPU policy allowing all users
gpukill --guard --guard-add-gpu 1 --guard-gpu-memory-limit 16.0

# Remove GPU policy
gpukill --guard --guard-remove-gpu 0

Additional CLI Options:

--guard-group-members <MEMBERS>: Comma-separated list of group members
--guard-gpu-allowed-users <USERS>: Comma-separated list of allowed users for GPU
--guard-group-memory-limit <GB>: Group memory limit in GB
--guard-group-utilization-limit <PERCENT>: Group utilization limit percentage
--guard-group-process-limit <COUNT>: Group process limit count
--guard-gpu-memory-limit <GB>: GPU memory limit in GB
--guard-gpu-utilization-limit <PERCENT>: GPU utilization limit percentage
--guard-gpu-reserved-memory <GB>: GPU reserved memory in GB

Policy Testing

# Test policies in dry-run mode
gpukill --guard --guard-test-policies

# Toggle dry-run mode
gpukill --guard --guard-toggle-dry-run

Configuration Management

# Export configuration
gpukill --guard --guard-export-config > guard_config.json

# Import configuration
gpukill --guard --guard-import-config guard_config.json

API Endpoints

Configuration Management

# Get Guard Mode configuration
GET /api/guard/config

# Update Guard Mode configuration
POST /api/guard/config
Content-Type: application/json
{
  "global": {
    "enabled": true,
    "dry_run": true,
    "default_memory_limit_gb": 16.0,
    "default_utilization_limit_pct": 80.0
  }
}

Policy Management

# Get policies
GET /api/guard/policies

# Update policies
POST /api/guard/policies
Content-Type: application/json
{
  "user_policies": {
    "developer": {
      "username": "developer",
      "memory_limit_gb": 8.0,
      "utilization_limit_pct": 70.0,
      "max_concurrent_processes": 3
    }
  }
}

Status and Testing

# Get Guard Mode status
GET /api/guard/status

# Toggle dry-run mode
POST /api/guard/toggle-dry-run

# Test policies
POST /api/guard/test-policies

Violation Types

Memory Violations

Excessive Memory Usage: Process exceeds memory limit
Memory Hoarding: Long-running processes with high memory usage
Memory Leaks: Processes with continuously increasing memory usage

Utilization Violations

High GPU Utilization: Process exceeds utilization limit
Sustained High Usage: Long periods of high GPU utilization
Resource Waste: Processes with low efficiency

Process Violations

Too Many Processes: User exceeds concurrent process limit
Long-running Processes: Processes exceeding duration limits
Unauthorized Processes: Processes not allowed by policy

Access Violations

GPU Access: Attempting to use blocked GPUs
Time Restrictions: Using GPUs during restricted hours
User Restrictions: Unauthorized user access

Enforcement Actions

Warning Actions

Console Notifications: Display warnings in terminal
Log File Entries: Record warnings in log files
Email Alerts: Send email notifications (if configured)
Webhook Notifications: Send alerts to external systems

Enforcement Actions

Process Termination: Kill violating processes
Resource Limits: Enforce memory and utilization limits
Access Blocking: Prevent access to restricted resources
User Notifications: Inform users of policy violations

Best Practices

Policy Design

Start Conservative: Begin with generous limits and tighten over time
Test Thoroughly: Use dry-run mode extensively before enabling enforcement
Monitor Closely: Watch for false positives and adjust policies accordingly
Document Policies: Keep clear records of policy decisions and changes

Implementation

Gradual Rollout: Enable policies for a subset of users first
User Communication: Inform users about new policies and limits
Training: Provide guidance on policy compliance
Feedback Loop: Collect user feedback and adjust policies

Monitoring

Regular Reviews: Periodically review policy effectiveness
Violation Analysis: Analyze patterns in policy violations
Performance Impact: Monitor system performance under policies
User Satisfaction: Track user satisfaction with policy enforcement

Troubleshooting

Common Issues

False Positives: Adjust policy thresholds if legitimate processes are flagged
Performance Impact: Monitor system performance under policy enforcement
User Complaints: Address user concerns about policy restrictions
Configuration Errors: Validate policy configuration syntax

Debugging

Enable Debug Logging: Use RUST_LOG=debug for detailed logs
Test Policies: Use dry-run mode to validate policy behavior
Check Configuration: Verify policy configuration files
Monitor Violations: Review violation logs for patterns

Development

Building

# Debug build (fastest, ~3 seconds)
cargo build

# Fast release build (recommended for development, ~28 seconds)
cargo build --profile release-fast

# Standard release build (production-ready, ~28 seconds)
cargo build --release

# Maximum optimization (slowest, best performance, ~60+ seconds)
cargo build --profile release-max

# Run tests
cargo test

# Run with logging
RUST_LOG=debug cargo run -- --list

Build Performance Optimization

The project includes multiple build profiles optimized for different use cases:

dev: Fast debug builds for development
release-fast: Optimized for development with good performance
release: Balanced optimization for production use
release-max: Maximum optimization for final releases

Performance improvements made:

Changed from fat LTO (lto = true) to thin LTO (lto = "thin")
Increased codegen units from 1 to 4 for parallel compilation
Added fast release profile for development workflows

Testing

The project includes comprehensive tests:

# Run all tests
cargo test

# Run specific test module
cargo test args::tests

# Run integration tests
cargo test --test integration_tests

# Run with mock NVML (for CI)
cargo test --features mock_nvml

Adding New Features

Define CLI arguments in src/args.rs
Implement core logic in appropriate module
Add tests for new functionality
Update documentation in this file
Test on multiple platforms

Performance Considerations

NVML Caching: GPU information is cached during snapshot creation
Process Enumeration: Uses efficient system calls for process information
Memory Management: Minimizes allocations in hot paths
Error Handling: Fast-path for common success cases

Acknowledgments

Built with NVML Wrapper for NVIDIA GPU management
AMD GPU support via ROCm and rocm-smi
Intel GPU support via intel-gpu-tools and intel_gpu_top
Apple Silicon GPU support via macOS system_profiler and system APIs
Uses Clap for command-line argument parsing
Table rendering powered by Tabled
Error handling with Color Eyre
Process management via Sysinfo
Regex processing with Regex
Signal handling with Nix

FilesExpand file tree

DETAILED.md

Latest commit

History

DETAILED.md

File metadata and controls

GPU Kill - Detailed Documentation

Table of Contents

Architecture

Core Components

Dependencies

Installation

Prerequisites

Install from Cargo

Build from Source

Cross-compilation

Command Reference

Global Options

List Operation

Kill Operation

Reset Operation

Audit Operation

Suspicious Usage Detection

Server Operation

Node Registration Operation

Enhanced Features

Multi-Vendor Support

Advanced Process Filtering

Container-Aware Process Detection

Apple Silicon Specific Features

Audit System

How It Works

Audit Commands

Suspicious Usage Detection

Detection Capabilities

Configuration System

Detection Examples

Dashboard

Audit Data Structure

Use Cases

Data Management

Cluster Management

Coordinator API

Starting the Coordinator

API Endpoints

Node Registration

Magic Moment Analysis

Remote Operations

SSH Configuration

Remote Authentication

Remote Requirements

Remote Operations

Dashboard

MCP Server

Overview

Architecture

Resources

gpu://list

gpu://processes

gpu://audit

gpu://policies

gpu://rogue-detection

Tools

kill_gpu_process

reset_gpu

scan_rogue_activity

create_user_policy

get_gpu_status

kill_processes_by_name

API Endpoints

HTTP Interface

MCP Methods

Configuration

Usage Examples

Starting the Server

Testing the Server

Development

Natural Language Examples

Output Formats

Table Format (Default)