Skip to content

Migrate scheduled jobs to Apalis with Postgres backend #447

@frewsxcv

Description

@frewsxcv

Why

Right now we have one periodic job (`resolve_taxa`, scheduled via Cloud Scheduler → Cloud Run Job after #446 lands). That setup is the right fit for a single idempotent task. Once we have a second or third job, the per-cron Terraform / console wiring stops paying off and a job framework starts.

Apalis with the Postgres backend is the leading candidate:

  • No new infrastructure — reuses our existing Cloud SQL.
  • Durable job state in the same DB as the data the jobs operate on.
  • Built-in retries, dead-letter, scheduled / cron / one-off triggers.
  • HTTP-triggerable jobs come for free (one of the main reasons we'd graduate from Cloud Scheduler).

When to do this

Trigger conditions, any one of which makes the migration worth doing:

  • A second periodic job lands (likely candidates: Wikidata thumbnail refresh, IUCN conservation-status sync, GBIF backbone freshness check on hot `taxa` rows).
  • We need an ad-hoc / API-triggered job ("re-resolve taxon X now", "backfill kingdom Y").
  • Retry semantics beyond "next scheduled pass picks it up" are needed for any job.

Until at least one of these is true, the Cloud Scheduler + one-shot binary pattern stays the right call.

Scope

  1. Add Apalis (`apalis` + `apalis-sql` postgres feature) as a workspace dependency.
  2. New crate `observing-jobs` (or fold into an existing crate) hosting:
    • The Apalis runtime / worker entry point binary.
    • Job definitions as `apalis::Job` impls.
    • Migration for the Apalis tables (separate sqlx migration, or via Apalis's own migrate fn — decide).
  3. Migrate `resolve_taxa` from a standalone binary to an Apalis job. Keep the same logic; cron triggers replace Cloud Scheduler.
  4. Replace the Cloud Run Job + Cloud Scheduler wiring with a single long-lived Cloud Run Service hosting the Apalis worker. Tear down the cron infra in IaC.
  5. Document how to add a new job (single example in CONTRIBUTING / docs).

Open questions

  • Schema location: Apalis's tables in `apalis` schema vs `public` vs reuse of `appview` / `ingester`. Probably its own schema for clarity; revoke runtime-role write access on it from non-job services.
  • Workers per service: one Cloud Run Service for all jobs, or one per job class? Start with one; split if a noisy neighbor emerges.
  • Backwards compatibility: do we keep `resolve_taxa` as a one-shot CLI for ops use, or fully replace? Probably keep — it's useful for ad-hoc backfills outside the job system.
  • Postgres connection pressure: Apalis polls the DB for jobs. Tune polling interval and worker count. Verify it doesn't move the needle on Cloud SQL load.
  • Observability: how to surface job state in our existing tracing/Cloud Logging stack. Apalis has `apalis-prometheus` and tracing layers; pick one.

Acceptance criteria

  • At least two distinct jobs running on Apalis (the trigger condition for this work).
  • `resolve_taxa` runs on the same cadence as today, via Apalis, with the same idempotency guarantees.
  • Cloud Scheduler trigger for `resolve_taxa` removed.
  • An HTTP endpoint (or admin CLI) exists that can ad-hoc-trigger any job.
  • Job retries / failures are visible in our existing logs.
  • Docs or a README section showing how to add a new job in <50 lines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions