Why
Right now we have one periodic job (`resolve_taxa`, scheduled via Cloud Scheduler → Cloud Run Job after #446 lands). That setup is the right fit for a single idempotent task. Once we have a second or third job, the per-cron Terraform / console wiring stops paying off and a job framework starts.
Apalis with the Postgres backend is the leading candidate:
- No new infrastructure — reuses our existing Cloud SQL.
- Durable job state in the same DB as the data the jobs operate on.
- Built-in retries, dead-letter, scheduled / cron / one-off triggers.
- HTTP-triggerable jobs come for free (one of the main reasons we'd graduate from Cloud Scheduler).
When to do this
Trigger conditions, any one of which makes the migration worth doing:
- A second periodic job lands (likely candidates: Wikidata thumbnail refresh, IUCN conservation-status sync, GBIF backbone freshness check on hot `taxa` rows).
- We need an ad-hoc / API-triggered job ("re-resolve taxon X now", "backfill kingdom Y").
- Retry semantics beyond "next scheduled pass picks it up" are needed for any job.
Until at least one of these is true, the Cloud Scheduler + one-shot binary pattern stays the right call.
Scope
- Add Apalis (`apalis` + `apalis-sql` postgres feature) as a workspace dependency.
- New crate `observing-jobs` (or fold into an existing crate) hosting:
- The Apalis runtime / worker entry point binary.
- Job definitions as `apalis::Job` impls.
- Migration for the Apalis tables (separate sqlx migration, or via Apalis's own migrate fn — decide).
- Migrate `resolve_taxa` from a standalone binary to an Apalis job. Keep the same logic; cron triggers replace Cloud Scheduler.
- Replace the Cloud Run Job + Cloud Scheduler wiring with a single long-lived Cloud Run Service hosting the Apalis worker. Tear down the cron infra in IaC.
- Document how to add a new job (single example in CONTRIBUTING / docs).
Open questions
- Schema location: Apalis's tables in `apalis` schema vs `public` vs reuse of `appview` / `ingester`. Probably its own schema for clarity; revoke runtime-role write access on it from non-job services.
- Workers per service: one Cloud Run Service for all jobs, or one per job class? Start with one; split if a noisy neighbor emerges.
- Backwards compatibility: do we keep `resolve_taxa` as a one-shot CLI for ops use, or fully replace? Probably keep — it's useful for ad-hoc backfills outside the job system.
- Postgres connection pressure: Apalis polls the DB for jobs. Tune polling interval and worker count. Verify it doesn't move the needle on Cloud SQL load.
- Observability: how to surface job state in our existing tracing/Cloud Logging stack. Apalis has `apalis-prometheus` and tracing layers; pick one.
Acceptance criteria
Why
Right now we have one periodic job (`resolve_taxa`, scheduled via Cloud Scheduler → Cloud Run Job after #446 lands). That setup is the right fit for a single idempotent task. Once we have a second or third job, the per-cron Terraform / console wiring stops paying off and a job framework starts.
Apalis with the Postgres backend is the leading candidate:
When to do this
Trigger conditions, any one of which makes the migration worth doing:
Until at least one of these is true, the Cloud Scheduler + one-shot binary pattern stays the right call.
Scope
Open questions
Acceptance criteria