Skip to content

Antalya-26.3 Added support for TTL EXPORT#1810

Open
mkmkme wants to merge 29 commits into
antalya-26.3from
mkmkme/antalya-26.3/ttl-export-partition
Open

Antalya-26.3 Added support for TTL EXPORT#1810
mkmkme wants to merge 29 commits into
antalya-26.3from
mkmkme/antalya-26.3/ttl-export-partition

Conversation

@mkmkme
Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme commented May 19, 2026

Fixes #1793

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

ReplicatedMergeTree gains support for TTL EXPORT TO TABLE

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@mkmkme mkmkme added antalya port-antalya PRs to be ported to all new Antalya releases antalya-26.3 labels May 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

Workflow [PR], commit [4abd8db]

@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 19, 2026

With the new commit 04206 (syntax check) is now passing

@mkmkme mkmkme force-pushed the mkmkme/antalya-26.3/ttl-export-partition branch from 0f90fd6 to 1d0c1a7 Compare May 19, 2026 11:14
mkmkme and others added 9 commits May 20, 2026 15:02
Tests describe the contract for the upcoming `TTL ... EXPORT TO db.table`
action. They are added before the C++ implementation so they double as the
acceptance criteria.

Stateless (tests/queries/0_stateless):
- 04206_ttl_export_partition_syntax: parser/metadata round-trip and rejection
  of (a) two EXPORT TTLs to the same destination and (b) EXPORT TTL on a
  table without a partition key.
- 04207_ttl_export_partition_basic: happy path, plus an in-line assertion
  that a future-dated partition is not exported.
- 04208_ttl_export_partition_skip_already_exported: re-triggering after a
  partition has been exported does not duplicate it.

Integration (tests/integration/test_ttl_export_partition):
- test_basic_to_iceberg, test_only_one_replica_submits,
  test_failure_and_backoff, test_serial_across_partitions,
  test_replica_restart_mid_export,
  test_modify_ttl_picks_up_with_materialize, test_disabled_replica,
  test_dedup_via_high_water_mark.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce parser, AST, and metadata plumbing for `TTL ... EXPORT TO db.table`.
No background scheduler or per-part TTL info yet — those land in follow-up
commits. The clause is recognised, round-trips through `SHOW CREATE TABLE`,
and the resulting `TTLDescription` is collected into `TTLTableDescription`'s
new `export_ttl` list (exposed via `StorageInMemoryMetadata::getExportTTLs`).

Validation in `TTLTableDescription::getTTLForTableFromAST`:
  * reject two `EXPORT` clauses to the same destination,
  * reject `EXPORT` TTL on a table with no partition key.

The destination-specific override of `TTLDescription::result_column`
(`"_export_" + db + "." + table`) is required so that future per-part TTL
info (keyed by `result_column`) keeps separate clocks per destination.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stores per-part TTL info under `MergeTreeDataPartTTLInfos::export_ttl`,
keyed by `TTLDescription::result_column`. The map is:

  * populated at write time in `MergeTreeDataWriter` for every TTL
    returned by `getExportTTLs`,
  * recomputed during `MATERIALIZE TTL` and merge-time TTL recompute via
    `TTLCalcTransform` / `TTLTransform` (a new `TTLUpdateField::EXPORT_TTL`
    finalizes into the right map),
  * serialized in JSON under the `"export"` key (mirroring the
    `recompression` entry),
  * propagated across merges through the existing `update` aggregation,
  * surfaced through `hasAnyNonFinishedTTLs` and `checkAllTTLCalculated`
    so old parts that predate the TTL are flagged for `MATERIALIZE TTL`.

Adds the partition-wide helper `getPartitionExportTTLMax`: returns the
max `export_ttl.max` across all parts of a partition, or `nullopt` if
any part is missing the entry (with optional `missing_parts_out` for
the scheduler to log). Deliberate: no on-the-fly evaluation — the user
runs `ALTER TABLE ... MATERIALIZE TTL` to backfill, same UX as moves
and recompression TTLs.

Also pulls `export_ttl` into `hasAnyTableTTL` / `hasOnlyRowsTTL` and the
`getColumnDependencies` TTL-column-set walk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Align the TTL syntax with `ALTER ... EXPORT PARTITION TO TABLE`: the
keyword is now `EXPORT TO TABLE <db.table>` instead of `EXPORT TO
<db.table>`. Parser, formatter, exception messages, and tests updated to
match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an `ExportOrigin` enum (`alter` | `ttl`) to the manifest body so
manifests submitted manually (`ALTER ... EXPORT PARTITION`) can be told
apart from manifests submitted by the upcoming TTL scheduler. Surfaced
as `system.replicated_partition_exports.export_origin
Enum8('alter' = 0, 'ttl' = 1)`. Existing manifests in ZooKeeper that
don't carry the field read back as `alter` for backwards compatibility.

`ttl`-origin manifests are skipped by manifest-TTL eviction: the
background cleanup in `ExportPartitionManifestUpdatingTask` and the
overwrite path in `StorageReplicatedMergeTree::exportPartitionToTable`
both refuse to consider them expired. The existing
`export_merge_tree_partition_force_export` setting still overrides via
the unchanged gate.

The write site keeps the default `ExportOrigin::alter`; the TTL
submitter writes `ExportOrigin::ttl` in a follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new query-level setting `export_merge_tree_partition_mark_as_ttl`
(default false). When set on `ALTER ... EXPORT PARTITION`, the resulting
manifest is written with `export_origin = ttl` (same as what the TTL
scheduler will write in a follow-up). The TTL scheduler always sets this
implicitly when it submits.

Enforces the "at most one ttl-origin manifest per (src, dest)"
invariant at submission time: when a ttl-origin manifest is being
created, scan siblings under `<zk_root>/exports/` for an existing
ttl-origin marker at a different `partition_id`. If found at `P_old`,
reject the submission as a back-fill (`new < P_old`) unless
`export_merge_tree_partition_force_export` is set; otherwise
best-effort `tryRemoveRecursive` of the old marker before creating the
new one. Same-key collisions continue to be handled by the existing
block.

A plain `alter` over a ttl marker at a different partition is allowed
without friction — alter manifests coexist with the ttl marker, and
the TTL scheduler will filter by `export_origin = ttl` when reading
its own state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extracts the readable-vs-insertable column diff and the partition-key
AST compare from `StorageReplicatedMergeTree::exportPartitionToTable`
into `ExportPartitionUtils::verifyExportDestinationCompatibility`, and
calls it from `TTLTableDescription::getTTLForTableFromAST` for every
`TTLMode::EXPORT` clause when not attaching. The destination is resolved
through `DatabaseCatalog::getTable`, matching the manual
`ALTER ... EXPORT PARTITION` flow (throws `UNKNOWN_TABLE` if missing).

The check is skipped under `is_attach=true` because the destination
table may not yet be loaded at server startup; submission-time
validation in `exportPartitionToTable` still covers that path.

Iceberg destinations skip the partition-key AST compare here; the
existing `verifyIcebergPartitionCompatibility` runs against the runtime
iceberg metadata at submission time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces `TTLExportScheduler`, a per-`StorageReplicatedMergeTree`
background driver that submits partition exports for tables with
`TTL ... EXPORT TO TABLE db.table`. The scheduler is stateless across
restarts: it reads the latest `export_origin = ttl` manifest from
ZooKeeper on every tick and acts on its status — no manifest →
submit the smallest eligible partition; PENDING → wait; COMPLETED →
walk forward to `partition_id > completed`; FAILED → resubmit with
`force_export=1` after per-partition exponential backoff; KILLED →
idle with a `LOG_WARNING` carrying the recovery recipe.

`submit` classifies outcomes as `Submitted | Transient | Failure` so
ZK CAS races and `UNKNOWN_TABLE` (destination dropped post-DDL) do
not bump backoff, while genuine submission errors do.

Adds three table-level settings used by the scheduler:
`export_merge_tree_partition_ttl_poll_interval_seconds` (default 5),
`export_merge_tree_partition_ttl_min_backoff_seconds` (default 1),
`export_merge_tree_partition_ttl_max_backoff_seconds` (default 60).

The scheduler is not yet wired into the background task pool; that
follows in a separate commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Declare the scheduler and its background task next to the other
`export_merge_tree_partition_*` task holders. Under the
`allow_experimental_export_merge_tree_partition` server gate:

- Construct the scheduler and create the `TTLExport` task, logging any
  exceptions from `run` via `tryLogCurrentException`.
- `ReplicatedMergeTreeRestartingThread::tryStartup` activates the task
  alongside the other export tasks; `partialShutdown` deactivates it.
- `alter` calls `ttl_export_task->schedule` when any `MODIFY TTL`
  command is in the alter so newly added EXPORT TTLs take effect
  immediately.
- `TTLExportScheduler::run` reschedules itself with
  `export_merge_tree_partition_ttl_poll_interval_seconds * jitter25`
  on the polling path. Early returns (shutdown, readonly, no EXPORT
  TTL, experimental gate off) intentionally skip the reschedule —
  deactivation, the `alter` hook, and server-level config drive those
  paths instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mkmkme mkmkme force-pushed the mkmkme/antalya-26.3/ttl-export-partition branch from 1d0c1a7 to f7d8a27 Compare May 20, 2026 13:10
mkmkme and others added 4 commits May 20, 2026 18:31
In `exportPartitionToTable`, the cross-partition swap used to delete the
existing ttl-origin marker before the `tryMulti` that creates the new
manifest. Any throw between the delete and the multi (parts.empty,
pending mutations, iceberg compatibility, or a ZK error mid-multi) left
the scheduler with no ttl marker at all, so the next tick of
`TTLExportScheduler` would treat the table as fresh and restart from the
oldest expired partition.

Now:

  - The sibling cache walk collects every stale ttl-origin entry for the
    destination, not just the first one encountered. The freshest by
    `create_time` is used for the back-fill check; the others are kept
    for cleanup. This makes the walk deterministic (no break-on-first-
    match in unspecified iteration order) and self-healing for any
    stragglers a previous cleanup may have missed.
  - The `tryRemoveRecursive` of those stale markers runs after the
    `tryMulti` succeeds. Failures during validation or in the multi
    itself now leave the existing marker intact, so the scheduler keeps
    its high-water mark.
  - The post-multi cleanup is best-effort; a ZK error there at worst
    leaves dead nodes, which the next ttl submission will reap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `export_merge_tree_partition_mark_as_ttl` to the Antalya 26.3 session
block, and the three TTL EXPORT scheduler MergeTree settings
(`export_merge_tree_partition_ttl_poll_interval_seconds`,
`export_merge_tree_partition_ttl_min_backoff_seconds`,
`export_merge_tree_partition_ttl_max_backoff_seconds`) to the 26.3 MergeTree
block, so `02995_new_settings_history` passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mkmkme mkmkme marked this pull request as ready for review May 20, 2026 17:42
@mkmkme mkmkme changed the title [WIP] Antalya-26.3 Added support for TTL EXPORT Antalya-26.3 Added support for TTL EXPORT May 20, 2026
instead of context's `current_database`

This should fix the stateless tests.
@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 21, 2026

test_database_iceberg/test.py::test_namespace_filter is failing on the current antalya-26.3 as well, so seems unrelated to me

@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 21, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ebd61019e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/Storages/MergeTree/TTLExportScheduler.cpp Outdated
Comment thread src/Storages/TTLDescription.cpp
Comment thread src/Parsers/ASTTTLElement.cpp Outdated
mkmkme and others added 2 commits May 21, 2026 12:34
Partition IDs are arbitrary strings derived from the partition expression;
lex comparison can permanently strand newer partitions when widths differ
(`PARTITION BY year` with values 9, 10, 11 lex-orders as "10", "11", "9",
so after exporting "9" the scheduler would skip "10" and "11" forever).

`pickPartition` now forward-walks by expiration max, with the floor's
expiration max recomputed from the source's current parts. The floor's
exact `partition_id` is still excluded by string equality so we don't
re-pick the partition we just completed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous representation flattened `dst_database` and `dst_table` into a
single `destination_name` string joined by a dot. That made the boundary
ambiguous for quoted table names that legitimately contain a dot — e.g.
``TTL ... EXPORT TO TABLE `a.b` `` round-tripped through the formatter as
`` `a`.`b` `` (database `a`, table `b`), changing semantics on metadata
reattach and sending exports to the wrong destination via
`QualifiedTableName::parseFromString` in the scheduler and the DDL-time
schema-compat check.

Mirror the convention already used by `ASTAlterQuery` for the matching
`ALTER ... EXPORT PARTITION ... TO TABLE` syntax: store `destination_database`
and `destination_name` as two separate strings on `ASTTTLElement` and
`TTLDescription`. The parser fills both via `parseDatabaseAndTableName`; the
formatter emits `backQuoteIfNeed(db) + "." + backQuoteIfNeed(table)` when
qualified, else just the table. Both `parseFromString` call sites use the
fields directly.

Per-part `export_ttl` map is keyed by `TTLDescription::getExportKey`, which
uses `backQuoteIfNeed` for both parts so qualified `(a, b)` and unqualified
single-table `a.b` produce distinct keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 21, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48097f2db9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/Storages/StorageReplicatedMergeTree.cpp Outdated
Comment thread src/Storages/TTLDescription.cpp
mkmkme and others added 3 commits May 21, 2026 13:10
`exportPartitionToTable` rejected a `mark_as_ttl=1` submission when the new
`partition_id` was lex-less than the existing ttl marker's `partition_id`.
With variable-width partition IDs (e.g. `PARTITION BY year` yielding "9", "10"),
the scheduler's natural forward step picks "10" after completing "9", the
guard throws BAD_ARGUMENTS because `"10" < "9"` lex, the scheduler classifies
that as `Failure`, bumps backoff, and retries with the same partition —
stalling TTL exports for that destination.

Mirror the `pickPartition` fix: locate the EXPORT TTL targeting this
destination, compute the expiration max for both the new partition and the
existing marker's partition from the source's current parts, and throw only
when the new expiration max is strictly less than the old. Skip the guard
when no matching TTL is found (orphaned marker after the TTL was dropped) or
when either partition's parts are gone (DELETE TTL already cleaned up).
`export_merge_tree_partition_force_export` continues to bypass the whole
guard, including legitimate backward moves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`TTLExportScheduler` is only constructed and activated by
`StorageReplicatedMergeTree`; the base `MergeTreeData::exportPartitionToTable`
throws `NOT_IMPLEMENTED`. Without a DDL-time guard, `CREATE TABLE ... ENGINE
= MergeTree ... TTL ... EXPORT TO TABLE` and the matching `ALTER` succeeded
but no background exporter ever submitted manifests — silent no-op.

Throw `NOT_IMPLEMENTED` at parse time:
- CREATE: in `registerStorageMergeTree.cpp` right after `getTTLForTableFromAST`,
  using the local `replicated` bool.
- ALTER: in `MergeTreeData::checkAlterIsPossible`, using virtual
  `supportsReplication`. The check can't live in `checkTTLExpressions`
  because that method is also called from `MergeTreeData`'s constructor,
  where virtual dispatch returns the base-class value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the full `std::map<String, DataPartsVector>` build + two lookups
with a single pass that fills two `DataPartsVector`s for the new and the
existing-marker partitions directly. `getPartitionExportTTLMax` returns
`nullopt` for an empty vector, so the `find`/`end` guards collapse into the
function call. No semantic change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 21, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd3a2ac2ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/Storages/StorageReplicatedMergeTree.cpp Outdated
Comment thread src/Storages/TTLDescription.cpp
mkmkme and others added 2 commits May 21, 2026 15:00
`StorageInMemoryMetadata::getExportTTLs` returns `TTLDescriptions` by value.
The range-based for at the matching-TTL search bound the loop to that
temporary, and the temporary's lifetime ends at the closing brace of the
range-for statement. The `matching_ttl` pointer set inside the loop then
dangled by the time `getPartitionExportTTLMax` dereferenced it below —
use-after-free on every `mark_as_ttl=1` submission that exercised the
backward-marker guard.

Hoist the returned container into a local so it outlives the pointer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ALTER ... EXPORT PARTITION ... TO TABLE` resolves `to_database` at submit
time, and `CREATE MATERIALIZED VIEW ... TO db.t` resolves the TO target at
parse time via `setCurrentDatabase` in `InterpreterCreateQuery`. Two
different conventions exist:

  - CREATE MV target: filled with the user's session `current_database`.
  - ALTER expressions: filled with the source table's database via
    `AddDefaultDatabaseVisitor` in `InterpreterAlterQuery`.

TTL EXPORT had no analogous step, so the AST stored whatever the user
wrote, and downstream code had to re-resolve at every use. A previous
attempt to resolve only inside `TTLDescription::getTTLFromAST` produced
inconsistent map keys for the per-part `export_ttl` info between
CREATE-time and ATTACH-time `TTLDescription` instances, which broke the
`test_replica_restart_mid_export` integration test.

Follow the existing patterns: teach `AddDefaultDatabaseVisitor` about
`ASTTTLElement` so ALTER fills `destination_database` with the source
table's database, and add an analogous walk to
`InterpreterCreateQuery::createTable` so CREATE fills it with the user's
`current_database`. From this point on, every code path sees a
fully-qualified destination in the AST and downstream `TTLDescription`,
and the per-part map key (`TTLDescription::getExportKey`) is consistent
across CREATE / ATTACH / restart.

The 04206 syntax test extracts the formatted TTL clause from
`system.tables.create_table_query`, which now contains the qualified
destination; strip the `currentDatabase()` prefix so the reference output
stays stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 21, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 153b4a1677

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/Storages/MergeTree/TTLExportScheduler.cpp Outdated
The scheduler previously recovered the floor partition's expiration max by
looking up the marker's `partition_id` in the source's currently active parts.
With the canonical paired-TTL recipe — `EXPORT after 7 days, DELETE after 30
days` — the marker's partition is dropped by retention as a matter of course.
At that point the lookup misses, `floor_max` stays unset, the ordering filter
in `TTLExportScheduler::pickPartition` is silently skipped, and any older
expired partition with surviving parts becomes eligible. The result is a
silent replay and a backwards step for the `(source, destination)` stream.

Add `export_ttl_max` to `ExportReplicatedMergeTreePartitionManifest` and
populate it at submit time from the matching `EXPORT TTL` clause and the
exact parts being exported. The contract is: a manifest with `export_origin
= ttl` always carries a non-zero `export_ttl_max`, and together with
`status = COMPLETED` it guarantees that the partition has been successfully
exported up to that expiration max. `pickPartition` now reads the marker's
stored watermark directly — no source-parts lookup, so the ordering survives
the marker's partition being dropped.

Enforcing the invariant at the write site lets the submit-time
backward-marker guard fold its work too: the new partition's max comes from
the same `getPartitionExportTTLMax` call that populates the manifest, and the
existing marker's max comes from its stored `export_ttl_max` (read out of the
in-memory `export_merge_tree_partition_task_entries` cache). The duplicate
matching-TTL lookup and the separate `old_parts/new_parts` bucketing scan are
gone.

Two new throws at the write site surface bad inputs explicitly:
`mark_as_ttl=1` without a matching `EXPORT TTL` clause, and a partition whose
parts pre-date the TTL (the user is told to run `ALTER TABLE ... MATERIALIZE
TTL`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 21, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6cbe79cb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/Storages/StorageReplicatedMergeTree.cpp
mkmkme and others added 2 commits May 21, 2026 19:54
The watermark-based forward walk used a strict-less comparison on
`expiration_max` alone. When two partitions share the same partition-wide
expiration max — which happens whenever the EXPORT TTL expression is not
monotonic by partition key, e.g. `PARTITION BY user_id` or hash
partitioning — the scheduler would ping-pong indefinitely:

- Marker at `(A, T)`; `pickPartition` sees `B` with max `T`, `T < T` is
  false → `B` is eligible → submit `B`.
- Marker becomes `(B, T)`; same logic flips → submit `A` again.
- The submit-time backward-marker guard used the same one-key check, so
  it did not stop the cycle either, and the stale-marker cleanup removed
  each prior manifest at the end of the new submission.

Each cycle replays data to the destination.

Tie-break the walk by `partition_id` lex order as a secondary key:

- `TTLExportScheduler::pickPartition` rejects partitions with
  `(*max_ttl, pid) <= (floor->expiration_max, floor->partition_id)` and
  picks the smallest tuple as the best candidate.
- `StorageReplicatedMergeTree::exportPartitionToTable` mirrors the
  comparison in the backward-marker guard.

Partition IDs are arbitrary strings (lex `"10" < "9"`), so they cannot
serve as the primary key for forward progress — but as a deterministic
tie-breaker for equal expiration_max they're fine, since the ordering
among tied partitions has no semantic meaning. Any stable order will do;
lex is the cheapest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation

The local `alter` path already schedules `ttl_export_task` when a `MODIFY
TTL` adds an EXPORT TTL, but replicas that did not initiate the ALTER apply
the metadata change through `setTableStructure` / `checkAndFindDiff` and
never enter that branch. `TTLExportScheduler::run` returns without
rescheduling when there is no EXPORT TTL, so a replica that started up
without one and later receives one via replication stays dormant
indefinitely. In a failover where the initiating replica goes down, TTL
exports stop cluster-wide.

Hook `setTableStructure` to call `ttl_export_task->schedule()` exactly on
the transition from "no EXPORT TTL" to "has EXPORT TTL" — the canonical
landing point for replicated metadata changes, and the precise check
avoids spurious wakes on unrelated replicated ALTERs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 21, 2026

@codex review

@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 21, 2026

Had to argue with the AI a bit about what is considered an issue (and spent $10 just on this audit alone 🥲), but eventually got a clean report


AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1810 (TTL EXPORT TO, branch vs antalya-26.3):

No confirmed defects in reviewed scope.

Coverage summary:

  • Scope reviewed: parser/AST and DDL integration for TTL ... EXPORT TO TABLE, replicated scheduler flow, manifest lifecycle, timeout/exception bookkeeping, and replicated export system-table surfaces.
  • Categories failed: none confirmed.
  • Categories passed: parser/DDL correctness, scheduler forward-walk logic, backward-marker guard coherence, replicated wake-up on MODIFY TTL, and no confirmed new race/deadlock/UAF/lifetime regressions in reviewed paths.
  • Assumptions/limits: static audit only (no runtime fault-injection in this pass); dangerous override combinations with export_merge_tree_partition_force_export=1 treated as explicit operator responsibility by policy.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0469250927

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

|| entry.manifest.destination_table != dest_table)
continue;

if (!latest || entry.manifest.create_time > latest->create_time)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Select TTL marker by durable order, not wall-clock time

This picks the current TTL marker using manifest.create_time, which comes from each replica’s local time(nullptr) and is not globally ordered. When stale ttl-origin manifests coexist (which this change explicitly allows as a transient state after best-effort cleanup), clock skew can make an older marker look newer, so the scheduler may follow the wrong status/partition (e.g., idling on a stale KILLED marker or replaying older partitions) and lose forward progress guarantees. The marker should be chosen by a replica-independent durable order (for example the persisted (export_ttl_max, partition_id) watermark or ZooKeeper ordering metadata), not per-host wall clock.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

@arthurpassos arthurpassos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll continue it tomorrow morning

Comment thread src/Core/SettingsChangesHistory.cpp Outdated
addSettingsChanges(merge_tree_settings_changes_history, "26.3",
{
{"export_merge_tree_partition_ttl_poll_interval_seconds", 5, 5, "New setting for the TTL EXPORT scheduler poll interval."},
{"export_merge_tree_partition_ttl_min_backoff_seconds", 1, 1, "New setting for the TTL EXPORT scheduler minimum backoff."},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to merge it with the backoff policy?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was planning to remove the backoff policy from the TTL scheduler itself and rely strictly on export partition backoff. Since it's not yet a thing, I thought about keeping it for a while. But considering the fact it actually introduces settings and changes in settings history, I think I'll drop it now

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped: c7ed5f1


namespace ExportPartitionUtils
{
void verifyExportDestinationCompatibility(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't checked the rest of the code yet, but we must also check for partition expression compatbility for data lakes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah true, thanks. I haven't moved that part of the code yet. Will do!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done it in 3767581

std::optional<time_t> getPartitionExportTTLMax(
const TTLDescription & desc,
const DataPartsVector & parts_in_partition,
std::vector<String> * missing_parts_out = nullptr);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched for occurences of getPartitionExportTTLMax and none seem to actually use the last argument. Can't we just remove it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::optional<TTLExportScheduler::TtlMarker> TTLExportScheduler::findTtlMarker(
const String & dest_database, const String & dest_table)
{
std::lock_guard lock(storage.export_merge_tree_partition_mutex);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh boy, everywhere we r grabbing this lock. I am so freakin afraid of a deadlock or contention problems

mkmkme and others added 5 commits May 22, 2026 10:47
The schema-validation extraction in 77df5e6 moved the column-diff and the
non-data-lake partition-key compare to DDL time, but left iceberg partition
compatibility verified only at submission time. A user who declares an
`EXPORT TO TABLE` TTL against an Iceberg destination with an incompatible
partition spec gets the rejection only when the scheduler eventually fires,
which can be days or weeks after the ALTER.

Add `verifyIcebergPartitionCompatibilityAtDestination` to `ExportPartitionUtils`
(fetch the destination iceberg metadata + delegate to the existing
`verifyIcebergPartitionCompatibility`) and call it from
`TTLTableDescription::getTTLForTableFromAST` for data-lake destinations,
gated on `!is_attach` like the sibling column-diff check.

The submission-time call at `StorageReplicatedMergeTree.cpp:8638` is left
in place as the moving-target defense: Iceberg supports schema and partition
spec evolution (`ALTER TABLE ... ADD/DROP PARTITION FIELD` from Spark/Trino),
and TTL EXPORT runs on a horizon where such evolution is realistic. The
post-DDL drift case would otherwise slip past — Iceberg's optimistic
concurrency only catches divergence between submission and commit, not
between DDL and submission, so a partition-spec change before the scheduler
fires could land data under the wrong partition layout with no error.

The `EXPORT PART` path in `MergeTreeData.cpp:6744` is unchanged: it has no
DDL-time hook (manual single-part export), and submission is the only entry
point so there is no DDL→submit gap to bridge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Stateless test that checks that altering source table to an
   incompatible schema fails on DDL-time check
2. Integration test that checks that altering _destination_ table to an
   incompatible schema leads to export partition-level error and doesn't
   crash the server
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.3 antalya-26.3.10.20001 port-antalya PRs to be ported to all new Antalya releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants