Antalya 26.3: support external paths in Iceberg tables#1812
Open
zvonand wants to merge 3 commits into
Open
Conversation
zvonand
added a commit
to Altinity/RelEasy
that referenced
this pull request
May 19, 2026
`commit_cherry_pick_conflict_as_is` and `commit_conflict_markers` were doing `git add --all` before committing the with-conflict- markers checkpoint. That sweeps everything in the working tree that isn't gitignored — and real C++ repos accumulate plenty outside .gitignore: ClickHouse leaves server runtime data under `tmp/server_data*/store/<uuid>/<part>/...cmrk2`, build pipelines spit out generated headers, autosaves, etc. bug seen 2026-05-19: Altinity/ClickHouse#1812 ended up with **696 429 additions across 19 683 files** because tmp/server_data* was tracked-modified at the time of cherry-pick and got swept in. new helper `_stage_unmerged_paths` uses `git diff --name-only --diff-filter=U` to stage exactly the conflict-marked files. The clean parts of the cherry-pick are already staged by git automatically — only the unmerged paths (whose textual content is the markers themselves) need explicit staging. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5636ee3 to
90519de
Compare
zvonand
commented
May 20, 2026
Adapted PR ClickHouse#90740 (Read iceberg from various paths) to antalya-26.3 without applying the prerequisite upstream PR ClickHouse#100420 (IcebergPath / path_resolver refactor). The refactor is dropped; raw `String` paths are used instead. Adaptations from PR 90740 to antalya-26.3: - `IcebergPathFromMetadata` references → plain `String` (no `.serialize()`, no `IcebergPathFromMetadata::deserialize` wrapping). - `IcebergPathResolver & path_resolver` parameters → `const String & table_location`. Calls like `path_resolver.resolve(x)` become `x`. - `SecondaryStorages` infrastructure kept: thread-safe map of secondary object storages plus a `resolveObjectStorageForPath` helper that maps a metadata path to a (storage, key) pair. The IcebergPath-aware overload of `resolveObjectStorageForPath` was removed. - New protocol version `DBMS_CLUSTER_PROCESSING_PROTOCOL_VERSION_WITH_ICEBERG_ABSOLUTE_PATH = 7` used in `IcebergObjectSerializableInfo::{serializeForClusterFunctionProtocol, deserializeForClusterFunctionProtocol}` to gate the new `data_object_file_metadata_path` field and `requires_external_storage` check; `_path` for delete files goes through `SchemeAuthorityKey` on older protocols. Dropped (depend on upstream commits not on antalya-26.3): - `ExpireSnapshotsExecute.{cpp,h}`, `RemoveOrphanFilesExecute.{cpp,h}`, `SnapshotFilesTraversal.{cpp,h}` — extracted EXECUTE handlers from upstream PR introducing per-command refactor. PR 90740 only threads `secondary_storages` into these; the underlying refactor is a separate dependency. The antalya-26.3 `Iceberg::expireSnapshots` path is kept unchanged in `IcebergMetadata::executeCommand`. - `executeExpireSnapshots` / `executeRemoveOrphanFiles` dispatch in `IcebergMetadata::executeCommand` — depends on the dropped files. References: - Upstream PR: ClickHouse#90740 - antalya-26.1 backport (used as a structural reference for the no-IcebergPath adaptation): 0520e2e ("Allow to read iceberg table data from any location") Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- src/Storages/ObjectStorage/Utils.cpp: drop leftover `S3UriStyle::AUTO` arguments from two `S3::URI` ctor calls (S3UriStyle does not exist on antalya-26.3; the S3UriStyle parameter was already dropped from the ctor signature in the cherry-pick). - src/Storages/ObjectStorage/DataLakes/Iceberg/Mutations.cpp: `collectRetainedFiles` and `collectExpiredFiles` now pass a local empty `SecondaryStorages` to `getManifestList` / `getManifestFileEntriesHandle` so the new mandatory parameter compiles without threading external storages through the `Iceberg::expireSnapshots` dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b7609fa to
4cd644b
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick of upstream PR ClickHouse#90740 ("Read iceberg from various paths") to antalya-26.3.
The upstream PR is rebased on top of master that includes ClickHouse#100420 (
IcebergPath/path_resolver/ Spark-Azure interop refactor). That refactor brings ~1500 LoC unrelated to the feature itself, so it is not ported here — instead, PR 90740 is adapted to use rawStringpaths.The same approach was used for the antalya-26.1 backport ( #1461 , single commit
0520e2ee3b9"Allow to read iceberg table data from any location"), which is the structural reference for this port.What's kept from 90740:
SecondaryStoragesinfrastructure (thread-safemap<string, ObjectStoragePtr>) andresolveObjectStorageForPathhelper — the actual feature.DBMS_CLUSTER_PROCESSING_PROTOCOL_VERSION_WITH_ICEBERG_ABSOLUTE_PATH = 7and the newdata_object_file_metadata_path/requires_external_storagefields onIcebergObjectSerializableInfo.enable_url_encodingparameter onS3::URIctor.tests/integration/test_storage_iceberg_multistorage/.What's adapted:
IcebergPathFromMetadata→String; no.serialize(), noIcebergPathFromMetadata::deserialize(...)wrapping.const IcebergPathResolver & path_resolverparameters →const String & table_location;path_resolver.resolve(x)becomesx.S3UriStyle uri_stylector parameter dropped onS3::URI(type does not exist on antalya-26.3).What's dropped (depends on upstream commits not on antalya-26.3):
ExpireSnapshotsExecute.{cpp,h},RemoveOrphanFilesExecute.{cpp,h},SnapshotFilesTraversal.{cpp,h}— extracted per-command EXECUTE handlers from an unrelated refactor; PR 90740 only threadssecondary_storagesinto them. The antalya-26.3Iceberg::expireSnapshots+expireSnapshotsResultToPipepath is kept unchanged inIcebergMetadata::executeCommand.executeExpireSnapshots/executeRemoveOrphanFilesswitch branches.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Support Iceberg tables that have data files outside the table location or on a different object storage. Cherry-picked from ClickHouse#90740 (by @zvonand).
Documentation entry for user-facing changes
CI/CD Options
Exclude tests:
Regression jobs to run: