- Architecture
- Installation
- Command Reference
- Enhanced Features
- Audit System
- Suspicious Usage Detection
- Guard Mode
- Cluster Management
- Remote Operations
- Dashboard
- MCP Server
- Output Formats
- Configuration
- Safety Features
- Exit Codes
- Troubleshooting
- Development
GPU Kill is built with Rust and supports NVIDIA, AMD, Intel, and Apple Silicon GPUs through vendor-specific interfaces. The tool is designed with safety, usability, multi-vendor support, and distributed cluster management in mind.
- CLI Parser: Uses
clapfor robust argument parsing and validation - Vendor Abstraction: Multi-vendor GPU support (NVIDIA, AMD, Intel, Apple Silicon)
- NVML Wrapper: Interfaces with NVIDIA's management library
- ROCm Interface: AMD GPU management via rocm-smi
- Intel GPU Tools: Intel GPU management via intel_gpu_top
- Apple Silicon Interface: Apple Silicon GPU management via system_profiler and system APIs
- Enhanced Process Manager: Advanced process filtering and batch operations
- Container Detection: Container-aware process management
- Audit System: Automatic GPU usage tracking and historical analysis
- Renderer: Formats output as tables or JSON
- Configuration: Supports file and environment-based configuration
- Coordinator API: RESTful API server for cluster management
- SSH Remote Manager: Secure remote GPU management via SSH
- MCP Server: Model Context Protocol server for AI assistant integration
clap: Command-line argument parsingnvml-wrapper: NVIDIA GPU managementsysinfo: System process informationregex: Process filtering with regex patternstabled: Table formattingserde: JSON serializationserde_json: JSON parsing and generationchrono: Date and time handling for audit timestampsdirs: Cross-platform data directory managementtracing: Structured loggingcolor-eyre: Error handlingaxum: HTTP server framework for coordinator APItower: HTTP middleware and servicestower-http: HTTP middleware (CORS, tracing)uuid: Unique identifier generation for nodesjsonrpc-core: JSON-RPC protocol implementation for MCP serverjsonrpc-ws-server: WebSocket JSON-RPC server for MCPfutures-util: Async utilities for WebSocket handlingtokio: Async runtime for HTTP server and WebSocket
- NVIDIA GPU Support:
- NVIDIA GPU with supported drivers
- NVIDIA Management Library (NVML) - included with NVIDIA drivers
- AMD GPU Support:
- AMD GPU with ROCm drivers installed
- rocm-smi command-line tool available
- Intel GPU Support:
- Intel GPU with intel-gpu-tools package installed
- intel_gpu_top command-line tool available
- Apple Silicon GPU Support:
- macOS with Apple Silicon (M1, M2, M3, M4)
- system_profiler command-line tool available
- General Requirements:
- Rust 1.70+ (for building from source)
- Linux, macOS, or Windows
cargo install gpukillgit clone https://github.com/treadiehq/gpu-kill.git
cd gpukill
cargo build --releaseThe project supports cross-compilation for different platforms:
# For Linux from macOS
cargo build --release --target x86_64-unknown-linux-gnu
# For Windows from Linux
cargo build --release --target x86_64-pc-windows-gnu| Option | Description | Default |
|---|---|---|
--log-level <LEVEL> |
Set logging level | info |
--config <PATH> |
Configuration file path | None |
--remote <HOST> |
Remote host to connect to via SSH | None |
--ssh-user <USER> |
SSH username (requires --remote) | Current user |
--ssh-port <PORT> |
SSH port (requires --remote) | 22 |
--ssh-key <PATH> |
SSH private key path (requires --remote) | None |
--ssh-password <PASSWORD> |
SSH password (requires --remote) | Interactive prompt |
--ssh-timeout <SECONDS> |
SSH connection timeout (requires --remote) | 30 |
--register-node <URL> |
Register this node with a coordinator | None |
--help |
Show help information | - |
--version |
Show version information | - |
gpukill --list [OPTIONS]Options:
--details: Show detailed per-process information--watch: Refresh output every 2 seconds until Ctrl-C--output <FORMAT>: Output format (tableorjson)--vendor <VENDOR>: Filter by GPU vendor (nvidia,amd,intel,apple,all)
Examples:
# Basic listing
gpukill --list
# With process details
gpukill --list --details
# Watch mode
gpukill --list --watch
# JSON output
gpukill --list --output json
# Combined options
gpukill --list --details --watch --output json
# Filter by vendor
gpukill --list --vendor nvidia
gpukill --list --vendor amd --details
gpukill --list --vendor apple --watchgpukill --kill (--pid <PID> | --filter <PATTERN>) [OPTIONS]Required (one of):
--pid <PID>: Process ID to terminate--filter <PATTERN>: Filter processes by name pattern (supports regex)
Options:
--timeout-secs <SECONDS>: Timeout before escalation (default: 5)--force: Escalate to SIGKILL after timeout--batch: Kill multiple processes matching the filter (requires--filter)
Examples:
# Graceful termination of a single process
gpukill --kill --pid 12345
# Custom timeout for a single process
gpukill --kill --pid 12345 --timeout-secs 10
# Force escalation for a single process
gpukill --kill --pid 12345 --force
# Kill processes matching a pattern
gpukill --kill --filter "python.*"
# Batch kill all processes matching a pattern
gpukill --kill --filter "python.*" --batch --forcegpukill --reset [--gpu <ID> | --all] [OPTIONS]Required (one of):
--gpu <ID>: Specific GPU ID to reset--all: Reset all GPUs
Options:
--force: Force reset even with active processes
Examples:
# Reset specific GPU
gpukill --reset --gpu 0
# Reset all GPUs
gpukill --reset --all
# Force reset
gpukill --reset --gpu 0 --forcegpukill --audit [OPTIONS]Options:
--audit-user <USER>: Filter by specific user--audit-process <PATTERN>: Filter by process name pattern--audit-hours <HOURS>: Show records from last N hours (default: 24)--audit-summary: Show summary statistics instead of detailed records
gpukill --audit --rogue [OPTIONS]Detection Options:
--rogue: Perform rogue activity detection--rogue-config: Show current detection configuration--rogue-memory-threshold <GB>: Set memory usage threshold--rogue-utilization-threshold <PERCENT>: Set GPU utilization threshold--rogue-duration-threshold <HOURS>: Set process duration threshold--rogue-confidence-threshold <CONFIDENCE>: Set minimum confidence for detection
Whitelist Management:
--rogue-whitelist-process <NAME>: Add process to whitelist--rogue-unwhitelist-process <NAME>: Remove process from whitelist--rogue-whitelist-user <USERNAME>: Add user to whitelist--rogue-unwhitelist-user <USERNAME>: Remove user from whitelist
Configuration Management:
--rogue-export-config: Export configuration to JSON--rogue-import-config <FILE>: Import configuration from JSON file
Examples:
# Detect suspicious activity
gpukill --audit --rogue --audit-hours 48
# View current configuration
gpukill --audit --rogue-config
# Update detection thresholds
gpukill --audit --rogue-memory-threshold 15.0 --rogue-utilization-threshold 90.0
# Manage whitelists
gpukill --audit --rogue-whitelist-process "my-app"
gpukill --audit --rogue-whitelist-user "developer"
# Export/import configuration
gpukill --audit --rogue-export-config > config.json
gpukill --audit --rogue-import-config config.jsongpukill --server [OPTIONS]Options:
--server-port <PORT>: Port for coordinator API (default: 8080)--server-host <HOST>: Host to bind coordinator API (default: 0.0.0.0)
Description: Starts the GPU Kill coordinator server that provides:
- RESTful API for cluster management
- WebSocket server for real-time updates
- Node registration and heartbeat management
- Magic Moment contention analysis
Examples:
# Start coordinator on default port 8080
gpukill --server
# Start coordinator on custom port
gpukill --server --server-port 9000
# Start coordinator on all interfaces
gpukill --server --server-host 0.0.0.0gpukill --register-node <COORDINATOR_URL>Description: Registers this node with a coordinator server for cluster management:
- Generates unique node ID
- Sends periodic GPU snapshots to coordinator
- Maintains heartbeat for health monitoring
- Enables cluster-wide monitoring and management
Examples:
# Register with default coordinator
gpukill --register-node http://coordinator:8080
# Register with custom coordinator
gpukill --register-node http://gpu-cluster:9000
# Register with HTTPS coordinator
gpukill --register-node https://secure-cluster:8443gpukill automatically detects and utilizes available GPU vendors (NVIDIA, AMD, Intel). You can also filter the displayed information by vendor.
Options:
--vendor <VENDOR>: Filter by GPU vendor.nvidia: Show only NVIDIA GPUs.amd: Show only AMD GPUs.intel: Show only Intel GPUs.apple: Show only Apple Silicon GPUs.all: Show all detected GPUs (default if--vendoris not specified).
Examples:
# List only NVIDIA GPUs
gpukill --list --vendor nvidia
# List only AMD GPUs with details
gpukill --list --vendor amd --details
# List only Intel GPUs
gpukill --list --vendor intel
# Monitor all GPUs in watch mode
gpukill --list --vendor all --watchVendor Detection:
- NVIDIA: Automatically detected if NVML is available
- AMD: Automatically detected if rocm-smi is available
- Intel: Automatically detected if intel_gpu_top is available
- Apple Silicon: Automatically detected if running on macOS with Apple Silicon
- Mixed Systems: Supports systems with multiple GPU vendors
The --kill command now supports filtering processes by name using regular expressions, enabling powerful batch operations.
Options:
--filter <PATTERN>: A regular expression pattern to match against process names.--batch: When used with--filter, all matching processes will be targeted for termination. Without--batch,gpukillwill list matching processes and warn you to use--batchto proceed with killing.
Examples:
# List processes matching "python" (case-sensitive)
gpukill --list --details --filter "python"
# Kill all processes whose names start with "tensor"
gpukill --kill --filter "^tensor" --batch --force
# Find and kill all processes related to "jupyter"
gpukill --kill --filter "jupyter" --batchgpukill can now attempt to identify if a process is running within a container environment.
Options:
--containers: When used with--list, an additional column or field will indicate if a process is running in a container (e.g., Docker, LXC, Kubernetes).
Examples:
# List GPUs and show container info for processes
gpukill --list --details --containers
# Watch containerized processes on NVIDIA GPUs
gpukill --list --watch --containers --vendor nvidiaApple Silicon GPUs use unified memory architecture, which provides unique capabilities:
# Monitor Apple Silicon GPU
gpukill --list --vendor apple
# Monitor with details
gpukill --list --vendor apple --details
# Watch Apple Silicon GPU usage
gpukill --list --vendor apple --watchApple Silicon Characteristics:
- Unified Memory: GPU and CPU share the same memory pool
- Memory Estimation: GPU memory usage is estimated from active system memory
- Process Detection: Identifies Metal, OpenGL, and ML framework processes
- No Temperature/Power: These metrics are not available via system APIs
- No Reset Support: GPU reset requires kernel-level operations
The audit system automatically tracks GPU usage history whenever you run gpukill --list. This provides valuable insights into GPU utilization patterns, resource planning, and troubleshooting.
Automatic Data Collection:
- Every
gpukill --listcommand automatically logs GPU usage data - Data is stored in JSON Lines format for easy processing
- No additional configuration required - works out of the box
Data Storage:
- Linux:
~/.local/share/gpukill/audit.jsonl - macOS:
~/Library/Application Support/gpukill/audit.jsonl - Windows:
%APPDATA%\gpukill\audit.jsonl
Data Captured:
- Timestamp of each GPU check
- GPU information (index, name, memory usage, utilization, temperature, power)
- Process information (when processes are using GPU)
- Container information (when available)
- User information (when processes are detected)
Basic Audit Queries:
# Show last 24 hours of GPU usage
gpukill --audit
# Show last 6 hours
gpukill --audit --audit-hours 6
# Show last 3 days
gpukill --audit --audit-hours 72Filtered Queries:
# Show only specific user's GPU usage
gpukill --audit --audit-user john
# Show only specific process types
gpukill --audit --audit-process python
gpukill --audit --audit-process tensorflow
# Combine filters
gpukill --audit --audit-user alice --audit-process pytorch --audit-hours 12Summary Reports:
# Get usage summary for last 24 hours
gpukill --audit --audit-summary
# Get summary for last week
gpukill --audit --audit-summary --audit-hours 168The suspicious usage detection system provides comprehensive security monitoring for GPU resources, detecting crypto miners, suspicious processes, and resource abuse patterns.
Crypto Miner Detection:
- Identifies known mining software (xmrig, ccminer, ethminer, etc.)
- Detects mining patterns in process names and behavior
- Analyzes high GPU utilization and sustained usage
- Provides confidence-based scoring for mining activity
Suspicious Process Detection:
- Flags unusual process names and patterns
- Detects excessive resource usage
- Identifies processes from unusual users
- Analyzes process behavior over time
Resource Abuse Detection:
- Memory hogs consuming excessive GPU memory
- Long-running processes that may be stuck
- Excessive GPU utilization patterns
- Unauthorized access attempts
Risk Assessment:
- Confidence-based threat scoring (0.0 - 1.0)
- Risk level classification (Low, Medium, High, Critical)
- Weighted scoring for different threat types
- Actionable recommendations for each threat
Configuration File:
- Location:
~/.config/gpukill/rogue_config.toml - Format: TOML with comprehensive detection rules
- Auto-creation: Default configuration created on first use
- Version tracking: Metadata includes version and modification timestamps
Detection Thresholds:
[detection]
max_memory_usage_gb = 20.0 # Maximum memory usage threshold
max_utilization_pct = 95.0 # Maximum GPU utilization threshold
max_duration_hours = 24.0 # Maximum process duration threshold
min_confidence_threshold = 0.7 # Minimum confidence for detectionPattern Matching:
[patterns]
crypto_miner_patterns = ["cuda", "opencl", "miner", "hash"]
suspicious_process_names = ["xmrig", "ccminer", "ethminer"]
user_whitelist = ["root", "admin", "system"]
process_whitelist = ["python", "jupyter", "tensorflow"]Risk Scoring:
[scoring.threat_weights]
crypto_miner = 0.8
suspicious_process = 0.6
resource_abuser = 0.3
data_exfiltrator = 0.9
[scoring.risk_thresholds]
critical = 0.9
high = 0.7
medium = 0.5
low = 0.3Basic Detection:
# Scan for suspicious activity in last 24 hours
gpukill --audit --rogue
# Scan last 48 hours with JSON output
gpukill --audit --rogue --audit-hours 48 --output jsonConfiguration Management:
# View current configuration
gpukill --audit --rogue-config
# Update thresholds
gpukill --audit --rogue-memory-threshold 15.0
gpukill --audit --rogue-utilization-threshold 90.0
# Manage whitelists
gpukill --audit --rogue-whitelist-process "my-app"
gpukill --audit --rogue-whitelist-user "developer"Configuration Export/Import:
# Export configuration
gpukill --audit --rogue-export-config > security-config.json
# Import configuration
gpukill --audit --rogue-import-config security-config.jsonJSON Output:
# Export audit data as JSON for external processing
gpukill --audit --output json
# Export filtered data
gpukill --audit --audit-user john --output json > john_gpu_usage.jsonThe suspicious usage detection is fully integrated with the dashboard:
Check the Kill Suite website
Each audit record contains:
{
"id": 1758236888745,
"timestamp": "2025-09-18T23:08:08.745114Z",
"gpu_index": 0,
"gpu_name": "Apple M3 Max",
"pid": null,
"user": null,
"process_name": null,
"memory_used_mb": 3216,
"utilization_pct": 0.0,
"temperature_c": 0,
"power_w": 0.0,
"container": null
}Field Descriptions:
id: Unique identifier (timestamp + process ID)timestamp: ISO 8601 timestamp of the measurementgpu_index: GPU device indexgpu_name: Human-readable GPU namepid: Process ID (null for GPU-level records)user: Username (null for GPU-level records)process_name: Process name (null for GPU-level records)memory_used_mb: Memory usage in megabytesutilization_pct: GPU utilization percentagetemperature_c: GPU temperature in Celsiuspower_w: GPU power consumption in wattscontainer: Container name (null if not in container)
Resource Planning:
# Analyze peak usage patterns
gpukill --audit --audit-summary --audit-hours 168
# Find heavy users
gpukill --audit --audit-summary | grep "Top Users"Troubleshooting:
# Check what was running when GPU crashed
gpukill --audit --audit-hours 1
# Find processes that used most memory
gpukill --audit --audit-summary | grep "Top Processes"Compliance and Billing:
# Generate usage report for specific user
gpukill --audit --audit-user alice --output json > alice_usage.json
# Export all data for analysis
gpukill --audit --output json > gpu_usage_export.jsonPerformance Analysis:
# Check hourly usage patterns
gpukill --audit --audit-summary --audit-hours 24
# Monitor specific application usage
gpukill --audit --audit-process tensorflow --audit-hours 48File Size:
- Each audit record is approximately 200-300 bytes
- 1000 records ≈ 250KB
- 10,000 records ≈ 2.5MB
- Automatic cleanup recommended for long-term usage
Retention:
- No automatic cleanup by default
- Manual cleanup: Delete old records from
audit.jsonl - Recommended: Keep 30-90 days of data depending on needs
Backup:
# Backup audit data
cp ~/.local/share/gpukill/audit.jsonl gpu_audit_backup.jsonl
# Restore audit data
cp gpu_audit_backup.jsonl ~/.local/share/gpukill/audit.jsonlGPU Kill includes a powerful cluster management system that allows you to monitor and manage multiple GPU nodes from a central coordinator.
The coordinator is a RESTful API server that aggregates data from multiple GPU nodes and provides real-time cluster monitoring.
# Start coordinator on default port 8080
gpukill --server
# Start on custom port
gpukill --server --server-port 9000
# Start on all interfaces
gpukill --server --server-host 0.0.0.0GET /api/nodes- List all registered nodesPOST /api/nodes/:id/register- Register a new nodePOST /api/nodes/:id/snapshot- Update node snapshotGET /api/cluster/snapshot- Get cluster-wide snapshotGET /api/cluster/contention- Get GPU contention analysisWS /ws- WebSocket for real-time updates
Nodes automatically register themselves when they start the coordinator. Each node:
- Generates a unique UUID
- Reports hostname and IP address
- Sends periodic snapshots of GPU and process data
- Maintains heartbeat for health monitoring
The "Magic Moment" feature provides instant identification of GPU contention and resource blocking:
- Blocked GPUs: GPUs with high utilization that are blocking other users
- Top Users: Users ranked by GPU memory usage and utilization
- Contention Recommendations: Suggestions for optimizing GPU allocation
- Real-time Updates: Live updates via WebSocket connections
GPU Kill supports SSH-based remote management, allowing you to control GPUs across distributed systems.
# Basic remote connection
gpukill --remote staging-server --list
# With custom SSH options
gpukill --remote server --ssh-user admin --ssh-port 2222 --list
gpukill --remote server --ssh-key ~/.ssh/id_rsa --list
gpukill --remote server --ssh-password mypassword --list- SSH Keys: Preferred method for automated operations
- Password: Interactive or provided via command line
- SSH Agent: Uses system SSH agent for key management
- Custom Ports: Support for non-standard SSH ports
- SSH access to remote host
gpukillinstalled on remote host- Proper SSH key permissions (
chmod 600 ~/.ssh/id_rsa) - Network connectivity to remote host
All local operations work remotely:
# Remote monitoring
gpukill --remote server --list --details --watch
# Remote process management
gpukill --remote server --kill --pid 1234
gpukill --remote server --kill --filter "python.*" --batch
# Remote GPU control
gpukill --remote server --reset --gpu 0
gpukill --remote server --reset --all
# Remote auditing
gpukill --remote server --audit --audit-summaryThe GPU Kill dashboard is a modern web interface built with Nuxt.js and Tailwind CSS for real-time cluster monitoring.
Check the Kill Suite website.
GPU Kill includes a MCP server that enables AI assistants and other tools to interact with GPU management functionality through a standardized interface.
The MCP server provides a JSON-RPC interface that allows AI assistants to:
- Monitor GPU health and performance
- Kill problematic processes
- Reset crashed GPUs
- Scan for security threats
- Manage resource policies
- Automate GPU operations
The MCP server is built as a separate crate (gpukill-mcp) that integrates with the main GPU Kill functionality:
- HTTP Server: Runs on port 3001 (configurable via
MCP_PORT) - JSON-RPC Protocol: Standard MCP protocol for AI integration
- Resource Handler: Provides read-only access to GPU data
- Tool Handler: Executes GPU management actions
- Cross-platform: Works on macOS, Linux, and Windows
The MCP server exposes the following resources for AI assistants to read:
Current GPU status and utilization data including:
- GPU ID, name, and vendor
- Memory usage and total capacity
- Utilization percentage
- Temperature and power usage
- Active processes
Currently running GPU processes with:
- Process ID and name
- Memory usage
- User information
- GPU assignment
Historical GPU usage data including:
- Usage patterns over time
- Process execution history
- Resource utilization trends
- User activity logs
Current Guard Mode policies with:
- User-specific limits
- Group policies
- GPU-specific restrictions
- Time-based overrides
Security scan results including:
- Suspicious processes
- Crypto miner detection
- Resource abuse patterns
- Data exfiltration attempts
The MCP server provides the following tools for AI assistants to execute:
Kill a specific GPU process by PID:
{
"name": "kill_gpu_process",
"arguments": {
"pid": 12345,
"force": false
}
}Reset a specific GPU:
{
"name": "reset_gpu",
"arguments": {
"gpu_id": 0,
"force": false
}
}Scan for suspicious GPU activity:
{
"name": "scan_rogue_activity",
"arguments": {
"hours": 24
}
}Create a user policy for Guard Mode:
{
"name": "create_user_policy",
"arguments": {
"username": "developer",
"memory_limit_gb": 8.0,
"utilization_limit_pct": 70.0,
"process_limit": 3
}
}Get detailed status of a specific GPU:
{
"name": "get_gpu_status",
"arguments": {
"gpu_id": 0
}
}Kill all processes matching a name pattern:
{
"name": "kill_processes_by_name",
"arguments": {
"pattern": "python.*train",
"force": false
}
}- POST /mcp - Main MCP JSON-RPC endpoint
- GET /health - Health check endpoint
- initialize - Initialize the MCP connection
- resources/list - List available resources
- resources/read - Read resource contents
- tools/list - List available tools
- tools/call - Execute a tool
The MCP server can be configured using environment variables:
- MCP_HOST - Bind address (default: 127.0.0.1). Use 127.0.0.1 for local-only access. Set to 0.0.0.0 only if you need remote access and have other protections (e.g. firewall, auth).
- MCP_PORT - Port to listen on (default: 3001)
- RUST_LOG - Logging level (default: info)
# Start the MCP server
cargo run --release -p gpukill-mcp
# Or with custom port
MCP_PORT=3001 cargo run --release -p gpukill-mcp# Health check
curl -X GET http://localhost:3001/health
# List available tools
curl -X POST http://localhost:3001/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":"1","method":"tools/list","params":{}}'
# Get GPU list
curl -X POST http://localhost:3001/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":"2","method":"resources/read","params":{"uri":"gpu://list"}}'# Run in development mode
cargo run -p gpukill-mcp
# Run with debug logging
RUST_LOG=debug cargo run -p gpukill-mcp
# Build release version
cargo build --release -p gpukill-mcpAsk your AI assistant to use the MCP tools with natural language:
What GPUs do I have and what's their current usage?
Kill the Python process that's stuck on GPU 0
Kill all training processes that are using too much GPU memory
Show me GPU usage and kill any stuck processes
Scan for crypto miners and suspicious activity
Create a policy to limit user memory usage to 8GB
Reset GPU 1 because it's not responding
What processes are currently using my GPUs?
For detailed MCP server documentation, see mcp/README.md.
The table format provides a clean, human-readable view of GPU information:
┌─────┬──────────────────────┬─────────────────┬──────────┬──────────┬─────────┬─────────────┬──────┬─────────────────────┐
│ GPU │ NAME │ MEM_USED/TOTAL │ UTIL(%) │ TEMP(°C) │ POWER(W)│ ECC(volatile)│ PIDS │ TOP_PROC │
├─────┼──────────────────────┼─────────────────┼──────────┼──────────┼─────────┼─────────────┼──────┼─────────────────────┤
│ 0 │ NVIDIA GeForce RTX...│ 2.0/8.0 GiB │ 45.2 │ 72 │ 150.3 │ 0 │ 2 │ python:12345:1024MB │
└─────┴──────────────────────┴─────────────────┴──────────┴──────────┴─────────┴─────────────┴──────┴─────────────────────┘
Columns:
- GPU: GPU index
- NAME: GPU model name
- MEM_USED/TOTAL: Memory usage in GiB
- UTIL(%): GPU utilization percentage
- TEMP(°C): Current temperature
- POWER(W): Current power consumption
- ECC(volatile): ECC error count (if available)
- PIDS: Number of processes using this GPU
- TOP_PROC: Highest memory-using process (format: name:pid:memory)
When using --details, additional process rows are shown:
┌─────┬──────┬─────────┬─────────┬─────────┬─────────────┬─────────────┬──────────┐
│ GPU │ PID │ USER │ PROC │ VRAM_MB │ START_TIME │ CONTAINER │ │
├─────┼──────┼─────────┼─────────┼─────────┼─────────────┼─────────────┼──────────┤
│ 0 │ 12345│ developer│ python │ 1024 │ 1h 30m │ - │ │
│ 0 │ 12346│ developer│ python │ 512 │ 45m │ - │ │
└─────┴──────┴─────────┴─────────┴─────────┴─────────────┴─────────────┴──────────┘
JSON output provides structured data for scripting and automation:
{
"host": "workstation",
"ts": "2024-01-01T12:00:00.000Z",
"gpus": [
{
"gpu_index": 0,
"name": "NVIDIA GeForce RTX 4090",
"mem_used_mb": 2048,
"mem_total_mb": 8192,
"util_pct": 45.2,
"temp_c": 72,
"power_w": 150.3,
"ecc_volatile": 0,
"pids": 2,
"top_proc": {
"gpu_index": 0,
"pid": 12345,
"user": "developer",
"proc_name": "python",
"used_mem_mb": 1024,
"start_time": "1h 30m",
"container": null
}
}
],
"procs": [
{
"gpu_index": 0,
"pid": 12345,
"user": "developer",
"proc_name": "python",
"used_mem_mb": 1024,
"start_time": "1h 30m",
"container": null
}
]
}Create a configuration file at ~/.config/gpukill/config.toml:
# Logging
log_level = "info"
# Output
output_format = "table"
use_colors = true
table_width = 120
# Process management
default_timeout_secs = 5
max_processes_summary = 10
# Watch mode
watch_interval_secs = 2
# Display options
show_details = false| Variable | Description | Default |
|---|---|---|
GPUKILL_LOG_LEVEL |
Log level (trace, debug, info, warn, error) | info |
GPUKILL_OUTPUT_FORMAT |
Output format (table, json) | table |
GPUKILL_DEFAULT_TIMEOUT |
Default timeout in seconds | 5 |
GPUKILL_SHOW_DETAILS |
Show detailed process information | false |
GPUKILL_WATCH_INTERVAL |
Watch mode refresh interval | 2 |
GPUKILL_TABLE_WIDTH |
Table width limit | 120 |
GPUKILL_USE_COLORS |
Enable/disable colored output | true |
- Command-line arguments (highest priority)
- Environment variables
- Configuration file
- Default values (lowest priority)
- Existence Validation: Verifies the target process exists before attempting termination
- GPU Usage Check: Confirms the process is actually using a GPU (unless
--forceis used) - Graceful Shutdown: Sends SIGTERM first for clean process termination
- Escalation Control: Only escalates to SIGKILL with explicit
--forceflag - Timeout Protection: Prevents indefinite waiting with configurable timeouts
- Process Detection: Lists all active processes before reset
- Confirmation Required: Requires
--forceflag if active processes are detected - Index Validation: Verifies GPU index exists before reset attempt
- Operation Support: Checks if reset is supported on the target GPU
- Clear Messaging: Provides detailed error messages for unsupported operations
- Actionable Messages: Clear, specific error messages with suggested solutions
- Appropriate Exit Codes: Different exit codes for different failure modes
- Graceful Degradation: Continues operation when non-critical components fail
- NVML Fallback: Handles cases where NVML is unavailable with helpful messages
| Code | Meaning | Description |
|---|---|---|
0 |
Success | Operation completed successfully |
1 |
General Error | Unspecified error occurred |
2 |
NVML Failure | NVML initialization failed |
3 |
Invalid Arguments | Command-line argument validation failed |
4 |
Permission Error | Insufficient permissions for operation |
5 |
Unsupported Operation | Operation not supported on this system |
- Cause: NVIDIA drivers are not installed, are outdated, or NVML library is not accessible.
- Solution:
- Ensure NVIDIA drivers are properly installed and up to date.
- Verify that your GPU is recognized by the system (e.g., run
nvidia-smion Linux/Windows). - Check if you have the necessary permissions to access NVML (you might need to run
gpukillwithsudofor some operations).
- Cause: AMD ROCm drivers are not installed, or
rocm-smiis not in your PATH. - Solution:
- Install ROCm drivers for your AMD GPU
- Ensure
rocm-smiis accessible from your terminal - Check if you have the necessary permissions to access AMD GPU information
- Cause: Intel GPU tools are not installed, or
intel_gpu_topis not in your PATH. - Solution:
- Install intel-gpu-tools package for Intel GPU support
- Ensure
intel_gpu_topis accessible from your terminal - Check if you have the necessary permissions to access Intel GPU information
- Cause: The current user does not have the necessary privileges to perform the requested action (e.g., killing a process owned by another user, resetting a GPU).
- Solution:
- For process management, ensure you have rights to manage the target PID.
- For GPU reset or other system-level operations, try running
gpukillwithsudo. - Consult your system's documentation for managing user permissions for NVIDIA/AMD devices.
- Cause: The specified GPU index does not exist, or the GPU is not properly detected.
- Solution:
- Use
gpukill --listto see available GPU indices. - Ensure your GPU is physically connected and powered on.
- Verify that your GPU drivers are correctly installed and recognize the GPU.
- Use
- Cause: No supported GPU vendors (NVIDIA, AMD, Intel, or Apple Silicon) could be initialized or found on the system.
- Solution:
- Ensure at least one supported GPU vendor's drivers and management tools are correctly installed.
- Check system logs for driver-related errors.
- Cause: Invalid regex pattern or no processes match the filter.
- Solution:
- Verify your regex pattern is correct
- Use
gpukill --list --detailsto see available processes - Test your pattern with a simple filter first
- Cause: Container runtime not detected or process not in container.
- Solution:
- Ensure container runtime (Docker, Podman, etc.) is running
- Check if the process is actually running in a container
- Container detection is best-effort and may not work in all environments
- Cause: Permission issues or processes not found.
- Solution:
- Ensure you have permission to kill the target processes
- Use
--forceflag if processes are unresponsive - Check that the filter pattern matches existing processes
Guard Mode provides soft policy enforcement to prevent GPU resource abuse with safe testing capabilities. It allows administrators to set policies for users, groups, and GPUs, with configurable enforcement modes and comprehensive monitoring.
Guard Mode is designed to:
- Prevent Resource Abuse: Set limits on memory usage, GPU utilization, and concurrent processes
- Safe Testing: Dry-run mode allows testing policies without affecting running processes
- Flexible Enforcement: Choose between soft warnings and hard enforcement actions
- Real-time Monitoring: Live policy violation detection and alerting
Guard Mode configuration is stored in TOML format at:
- Linux:
~/.local/share/gpukill/guard_mode_config.toml - macOS:
~/Library/Application Support/gpukill/guard_mode_config.toml - Windows:
%APPDATA%\gpukill\guard_mode_config.toml
Control resource usage per user:
[user_policies.developer]
username = "developer"
memory_limit_gb = 8.0
utilization_limit_pct = 70.0
duration_limit_hours = 12.0
max_concurrent_processes = 3
priority = 5
allowed_gpus = []
blocked_gpus = []
time_overrides = []Control resource usage per group with member management:
[group_policies.researchers]
group_name = "researchers"
total_memory_limit_gb = 32.0
total_utilization_limit_pct = 80.0
max_concurrent_processes = 10
priority = 3
allowed_gpus = [0, 1]
blocked_gpus = []
members = ["alice", "bob", "charlie"]Key Features:
- Member Management: Specify which users belong to the group
- Total Resource Limits: Set aggregate limits for all group members
- CLI Support: Add members via
--guard-group-members "user1,user2,user3"
Control access to specific GPUs with user restrictions:
[gpu_policies."0"]
gpu_index = 0
max_memory_gb = 24.0
max_utilization_pct = 90.0
reserved_memory_gb = 2.0
allowed_users = ["alice", "bob"]
blocked_users = []
maintenance_window = nullKey Features:
- User Access Control: Specify which users can access specific GPUs
- Reserved Memory: Set aside memory that cannot be used by processes
- CLI Support: Add allowed users via
--guard-gpu-allowed-users "user1,user2,user3" - Flexible Access: Leave
allowed_usersempty to allow all users
Control resource usage during specific time periods:
[[time_policies]]
name = "business_hours"
start_time = "09:00"
end_time = "17:00"
days_of_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
memory_limit_gb = 16.0
utilization_limit_pct = 80.0
max_concurrent_processes = 5Safe testing without affecting running processes:
- Simulation Only: All policy violations are simulated
- No Actions Taken: Processes continue running normally
- Detailed Logging: Shows exactly what would happen
- Safe Testing: Perfect for policy validation
Warnings and notifications before hard actions:
- Warning Notifications: Send alerts for policy violations
- Grace Period: Allow time for users to adjust
- Escalation: Progress to hard enforcement if violations persist
- Logging: Record all violations and warnings
Immediate action on policy violations:
- Process Termination: Kill processes that violate policies
- Resource Limits: Enforce memory and utilization limits
- Access Control: Block access to restricted GPUs
- Immediate Action: No grace period for critical violations
# Enable Guard Mode
gpukill --guard --guard-enable
# Disable Guard Mode
gpukill --guard --guard-disable
# View current configuration
gpukill --guard --guard-config
# Set dry-run mode (safe testing)
gpukill --guard --guard-dry-run
# Set enforcement mode (live enforcement)
gpukill --guard --guard-enforceUser Policies:
# Add user policy
gpukill --guard --guard-add-user "developer" --guard-memory-limit 8.0 --guard-utilization-limit 70.0 --guard-process-limit 3
# Remove user policy
gpukill --guard --guard-remove-user "developer"
# Update policy limits
gpukill --guard --guard-memory-limit 16.0 --guard-utilization-limit 80.0 --guard-process-limit 5Group Policies:
# Add group policy with members
gpukill --guard --guard-add-group "developers" --guard-group-members "alice,bob,charlie" --guard-group-memory-limit 32.0 --guard-group-utilization-limit 80.0 --guard-group-process-limit 15
# Add group policy without members (empty group)
gpukill --guard --guard-add-group "testers" --guard-group-memory-limit 16.0
# Remove group policy
gpukill --guard --guard-remove-group "developers"GPU Policies:
# Add GPU policy with allowed users
gpukill --guard --guard-add-gpu 0 --guard-gpu-allowed-users "alice,bob" --guard-gpu-memory-limit 24.0 --guard-gpu-utilization-limit 90.0 --guard-gpu-reserved-memory 2.0
# Add GPU policy allowing all users
gpukill --guard --guard-add-gpu 1 --guard-gpu-memory-limit 16.0
# Remove GPU policy
gpukill --guard --guard-remove-gpu 0Additional CLI Options:
--guard-group-members <MEMBERS>: Comma-separated list of group members--guard-gpu-allowed-users <USERS>: Comma-separated list of allowed users for GPU--guard-group-memory-limit <GB>: Group memory limit in GB--guard-group-utilization-limit <PERCENT>: Group utilization limit percentage--guard-group-process-limit <COUNT>: Group process limit count--guard-gpu-memory-limit <GB>: GPU memory limit in GB--guard-gpu-utilization-limit <PERCENT>: GPU utilization limit percentage--guard-gpu-reserved-memory <GB>: GPU reserved memory in GB
# Test policies in dry-run mode
gpukill --guard --guard-test-policies
# Toggle dry-run mode
gpukill --guard --guard-toggle-dry-run# Export configuration
gpukill --guard --guard-export-config > guard_config.json
# Import configuration
gpukill --guard --guard-import-config guard_config.json# Get Guard Mode configuration
GET /api/guard/config
# Update Guard Mode configuration
POST /api/guard/config
Content-Type: application/json
{
"global": {
"enabled": true,
"dry_run": true,
"default_memory_limit_gb": 16.0,
"default_utilization_limit_pct": 80.0
}
}# Get policies
GET /api/guard/policies
# Update policies
POST /api/guard/policies
Content-Type: application/json
{
"user_policies": {
"developer": {
"username": "developer",
"memory_limit_gb": 8.0,
"utilization_limit_pct": 70.0,
"max_concurrent_processes": 3
}
}
}# Get Guard Mode status
GET /api/guard/status
# Toggle dry-run mode
POST /api/guard/toggle-dry-run
# Test policies
POST /api/guard/test-policies- Excessive Memory Usage: Process exceeds memory limit
- Memory Hoarding: Long-running processes with high memory usage
- Memory Leaks: Processes with continuously increasing memory usage
- High GPU Utilization: Process exceeds utilization limit
- Sustained High Usage: Long periods of high GPU utilization
- Resource Waste: Processes with low efficiency
- Too Many Processes: User exceeds concurrent process limit
- Long-running Processes: Processes exceeding duration limits
- Unauthorized Processes: Processes not allowed by policy
- GPU Access: Attempting to use blocked GPUs
- Time Restrictions: Using GPUs during restricted hours
- User Restrictions: Unauthorized user access
- Console Notifications: Display warnings in terminal
- Log File Entries: Record warnings in log files
- Email Alerts: Send email notifications (if configured)
- Webhook Notifications: Send alerts to external systems
- Process Termination: Kill violating processes
- Resource Limits: Enforce memory and utilization limits
- Access Blocking: Prevent access to restricted resources
- User Notifications: Inform users of policy violations
- Start Conservative: Begin with generous limits and tighten over time
- Test Thoroughly: Use dry-run mode extensively before enabling enforcement
- Monitor Closely: Watch for false positives and adjust policies accordingly
- Document Policies: Keep clear records of policy decisions and changes
- Gradual Rollout: Enable policies for a subset of users first
- User Communication: Inform users about new policies and limits
- Training: Provide guidance on policy compliance
- Feedback Loop: Collect user feedback and adjust policies
- Regular Reviews: Periodically review policy effectiveness
- Violation Analysis: Analyze patterns in policy violations
- Performance Impact: Monitor system performance under policies
- User Satisfaction: Track user satisfaction with policy enforcement
- False Positives: Adjust policy thresholds if legitimate processes are flagged
- Performance Impact: Monitor system performance under policy enforcement
- User Complaints: Address user concerns about policy restrictions
- Configuration Errors: Validate policy configuration syntax
- Enable Debug Logging: Use
RUST_LOG=debugfor detailed logs - Test Policies: Use dry-run mode to validate policy behavior
- Check Configuration: Verify policy configuration files
- Monitor Violations: Review violation logs for patterns
# Debug build (fastest, ~3 seconds)
cargo build
# Fast release build (recommended for development, ~28 seconds)
cargo build --profile release-fast
# Standard release build (production-ready, ~28 seconds)
cargo build --release
# Maximum optimization (slowest, best performance, ~60+ seconds)
cargo build --profile release-max
# Run tests
cargo test
# Run with logging
RUST_LOG=debug cargo run -- --listThe project includes multiple build profiles optimized for different use cases:
dev: Fast debug builds for developmentrelease-fast: Optimized for development with good performancerelease: Balanced optimization for production userelease-max: Maximum optimization for final releases
Performance improvements made:
- Changed from fat LTO (
lto = true) to thin LTO (lto = "thin") - Increased codegen units from 1 to 4 for parallel compilation
- Added fast release profile for development workflows
The project includes comprehensive tests:
# Run all tests
cargo test
# Run specific test module
cargo test args::tests
# Run integration tests
cargo test --test integration_tests
# Run with mock NVML (for CI)
cargo test --features mock_nvml- Define CLI arguments in
src/args.rs - Implement core logic in appropriate module
- Add tests for new functionality
- Update documentation in this file
- Test on multiple platforms
- NVML Caching: GPU information is cached during snapshot creation
- Process Enumeration: Uses efficient system calls for process information
- Memory Management: Minimizes allocations in hot paths
- Error Handling: Fast-path for common success cases
- Built with NVML Wrapper for NVIDIA GPU management
- AMD GPU support via ROCm and rocm-smi
- Intel GPU support via intel-gpu-tools and intel_gpu_top
- Apple Silicon GPU support via macOS system_profiler and system APIs
- Uses Clap for command-line argument parsing
- Table rendering powered by Tabled
- Error handling with Color Eyre
- Process management via Sysinfo
- Regex processing with Regex
- Signal handling with Nix