feat(amp-worker-gc): standalone GC job for control-plane scheduling#1989
Closed
feat(amp-worker-gc): standalone GC job for control-plane scheduling#1989
Conversation
c571586 to
74eb098
Compare
c95850e to
ec68363
Compare
…eduling Extract garbage collection from the worker's compaction task into a standalone job type managed by the controller via the job ledger. This decouples GC from compaction so it can run independently regardless of whether materialization jobs are active. New crate: amp-worker-gc with job descriptor, idempotency key, and collection algorithm. Controller schedules GC jobs per active physical table revision on a 60s interval. Workers execute them using the same stream-expired → delete- metadata → delete-files algorithm as the existing Collector.
Add GcSchedulingConfig with `enabled` (default false) and `interval` (default 60s) fields so GC scheduling can be toggled without code changes. The controller only spawns the GC scheduling task when enabled, preventing unintended GC job creation on deployment.
…egration tests - Add GcMetrics (expired_files_found, metadata_entries_deleted, files_deleted, files_not_found) with OpenTelemetry counters keyed by location_id - Add last_success_at check in schedule_gc_jobs() to respect the configured interval between GC runs per location (RFC compliance) - Add 3 integration tests in tests/src/tests/it_gc.rs verifying the full collection algorithm against real Postgres + filesystem - Update config.sample.toml with [gc_scheduling] section
The workspace_crates_match_amp_crates_list test validates that the hardcoded AMP_CRATES list matches actual workspace members. Adding the new amp-worker-gc crate to the workspace requires updating this list.
…e_ref tracing - Add GcSchedulerMetrics to controller scheduler with gc_jobs_dispatched_total, gc_jobs_skipped_in_flight_total, and gc_jobs_skipped_too_recent_total counters - Add table_ref field to GC job execution tracing span, recorded from the revision path after lookup - Pass Meter to Scheduler for metrics initialization
… entries Instead of paginating through all active revisions and scheduling no-op GC jobs, query gc_manifest for distinct location_ids with expired entries. This reduces unnecessary job ledger writes and worker notifications. Also removes the active-only filter so deactivated revisions are GC-eligible. GC is only concerned with the state of the gc_manifest, not the revision's active status.
… entries Instead of paginating through all active revisions and scheduling no-op GC jobs, query gc_manifest for distinct location_ids with expired entries. This reduces unnecessary job ledger writes and worker notifications. Also removes the active-only filter so deactivated revisions are GC-eligible. GC is only concerned with the state of the gc_manifest, not the revision's active status. Add gc::locations_with_expired_entries() query to metadata-db and 3 integration tests verifying pre-filter behavior, empty results, and deactivated revisions.
…ests Replace raw pgtemp usage with the testlib MetadataDbFixture for consistency with other integration tests. Extract shared setup into a GcTestCtx wrapper struct with helper methods (register_revision, register_file, file_ids, gc_context, data_store). Remove pgtemp direct dependency from tests crate.
Add two integration tests covering the scheduler's in-flight skip and recently-completed cooldown logic in schedule_gc_jobs(), plus helper methods on GcTestCtx to reduce boilerplate.
821dc2d to
1edae45
Compare
LNSD
requested changes
Mar 23, 2026
Contributor
LNSD
left a comment
There was a problem hiding this comment.
Can you chunk this PR down into smaller pieces?
For example, one PR for the amp-worker-gc job crate; then another to wire it to the worker, and after that another to wire it to the scheduler.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extracts garbage collection from the worker's compaction task into a standalone job type managed by the controller via the job ledger. Previously, GC only ran as a side effect of compaction — if no materialization jobs were active, no garbage was ever collected. This PR gives GC its own independent lifecycle with scheduling, retries, and observability.
Key design decisions
gc_manifestfor locations with expired files rather than scanning all revisions. This avoids creating no-op jobs for locations that don't need cleanup.enabled = false): Deploying this code changes nothing until the config toggle is flipped. This allows safe rollout — enable the new GC only after disabling the legacy worker-internal GC.gc_intervalcooldown).Changes
amp-worker-gc: Job descriptor, idempotency key, error types with retryability classification, OTel metrics, and collection algorithm (stream expired files → delete metadata → delete physical files)Gcvariant inJobDescriptorenum with execution wiring injob_impl.rsschedule_gc_jobs()method with pre-filter query, in-flight dedup, and completion cooldown. Background task spawned conditionally based on config.gc_jobs_dispatched_total,gc_jobs_skipped_in_flight_total,gc_jobs_skipped_too_recent_totalGcSchedulingConfigwithenabled(defaultfalse) andinterval(default60s)gc::locations_with_expired_entries()query for scheduler pre-filteringFollow-up
A follow-up PR will be required to remove the legacy worker-internal GC code from compaction, as per the phased rollout plan.
Test plan
LocationNotFoundfor invalid location IDs