Skip to content

Antalya 26.1 Backport of #94335 - Fix reading columns with dot-separated names from Iceberg#1448

Merged
zvonand merged 1 commit into
antalya-26.1from
backports/antalya-26.1/94335
Mar 2, 2026
Merged

Antalya 26.1 Backport of #94335 - Fix reading columns with dot-separated names from Iceberg#1448
zvonand merged 1 commit into
antalya-26.1from
backports/antalya-26.1/94335

Conversation

@mkmkme
Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme commented Feb 25, 2026

Fix reading columns with dot-separated names from Iceberg

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fixes an issue when Iceberg columns with dot in names returned NULL as values (ClickHouse#94335 by @mkmkme)

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

Fix reading columns with dot-separated names from Iceberg
@github-actions
Copy link
Copy Markdown

Workflow [PR], commit [de63cf0]

Copy link
Copy Markdown
Collaborator

@ilejn ilejn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@svb-alt svb-alt added port-antalya PRs to be ported to all new Antalya releases port-forward labels Feb 25, 2026
@zvonand zvonand merged commit 6c921d6 into antalya-26.1 Mar 2, 2026
664 of 677 checks passed
@alsugiliazova
Copy link
Copy Markdown
Member

Merged without clickhouse-regression run.

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Mar 5, 2026

PR #1448 Verification Report

ClickHouse CI: https://github.com/Altinity/ClickHouse/actions/runs/22397698371
Regression Run: https://github.com/Altinity/clickhouse-regression/actions/runs/22625222481


Regression Results: 66 pass / 4 fail / 1 skip (out of 71)

All 4 failures are pre-existing — the identical set appears in independent runs for PR #1430 and PR #1395, neither of which touch the same code.

Failed Suites (all pre-existing)

1. aggregate_functions_3 — SnapshotError

Failing test /aggregate functions/part 3/state/rankCorrState/with group by
Error SnapshotError — output snapshot mismatch for rankCorrState
This PR Job · Report · Artifacts
Same in PR #1430 Job
Same in PR #1395 Job

2. ice — SYNTAX_ERROR

Failing tests All 20 data types under /ice/feature/export parts/
Error SYNTAX_ERRORALTER TABLE ... EXPORT PARTS not recognized in this build
This PR Job · Report · Artifacts
Same in PR #1430 Job
Same in PR #1395 Job

3. settings — SnapshotNotFoundError

Failing tests All tests under /settings/default values/ (hundreds of settings)
Error SnapshotNotFoundError — snapshots not baselined for version 26.1.2.10001.altinitytest
This PR Job · Report · Artifacts
Same in PR #1430 Job
Same in PR #1395 Job

4. version — SnapshotError

Failing test /version/altinity/embedded logos
Error SnapshotError — expected for .altinitytest builds without production branding
This PR Job · Report · Artifacts
Same in PR #1430 Job

Cross-Reference Matrix

Suite PR #1448 PR #1430 PR #1395
aggregate_functions_3 FAIL FAIL FAIL
ice FAIL FAIL FAIL
settings FAIL FAIL FAIL
version FAIL FAIL
iceberg_1 pass pass FAIL
iceberg_2 pass pass FAIL
parquet pass pass FAIL

All 4 failures are present in every comparison run. PR #1448 introduces zero new failures.


PR-Relevant Suites — All Pass

Since this PR fixes Iceberg column handling and modifies the Parquet SchemaConverter, these suites are the most important:

Suite Result Job Report
iceberg_1 PASS Job Report
iceberg_2 PASS Job Report
parquet PASS Job
parquet_minio PASS Job
parquet_aws_s3 PASS Job

Verdict: PASS

No regressions introduced by PR #1448. All 4 failures are pre-existing (proven by identical failures in unrelated PRs #1430 and #1395). Both Iceberg suites and all 3 Parquet suites pass, directly validating the fix.

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Mar 5, 2026

PR #1448 Audit Review

1) Scope and partitions

PR is small (3 files), so no functional partitioning was required.

In-scope changes:

  • src/Processors/Formats/Impl/Parquet/SchemaConverter.cpp
  • src/Processors/Formats/Impl/Parquet/SchemaConverter.h
  • tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py (new in PR diff artifact)

Audit focus:

  • Iceberg field-id based column-name mapping in Parquet schema traversal.
  • Correct handling of dot-separated names at top-level and nested levels.
  • Name matching behavior in tuple element resolution.
  • Concurrency and C++ safety classes required by the audit standard.

2) Call graph

Primary runtime flow (read path):

  • Object storage read path provides ColumnMapper (Iceberg schema field-id mapping).
  • SchemaConverter::prepareForReading() / SchemaConverter::inferSchema() starts DFS over Parquet schema tree.
  • SchemaConverter::processSubtree() appends name components and dispatches to primitive/map/array/tuple handlers.
  • SchemaConverter::processSubtreeTuple() iterates child schema elements, resolves tuple element names, and recurses.
  • SchemaConverter::useColumnMapperIfNeeded(element, current_path) maps field_id -> clickhouse_name and derives child component name.

Upstream mapping source:

  • IcebergSchemaProcessor::traverseSchema() + traverseComplexType() build clickhouse-name -> field_id map.
  • ColumnMapper::setStorageColumnEncoding() builds reverse field_id -> clickhouse_name lookup.

Integration boundaries:

  • Parquet metadata (parq::SchemaElement) and field ids.
  • Iceberg metadata-derived ColumnMapper.
  • Query-side sample block lookup by mapped names.

Error propagation:

  • Missing field-id mapping: throws ICEBERG_SPECIFICATION_VIOLATION.
  • Type/name mismatches: throws TYPE_MISMATCH / THERE_IS_NO_COLUMN / DUPLICATE_COLUMN.

3) Transition matrix

ID Transition Invariant checks
T1 Parquet node entry -> mapped component derivation (useColumnMapperIfNeeded) I1: field-id mapping deterministic; I2: top-level mapped name preserved fully; I3: nested child name derived without truncating dotted child names
T2 processSubtree appends mapped component to logical path I4: node.name remains stable as ClickHouse-visible path for subsequent lookups
T3 Tuple child iteration (processSubtreeTuple) -> child-name matching I5: lookup-by-name receives child component, not unrelated suffix; I6: no false duplicate from truncation
T4 Recursion + output column materialization I7: output columns correspond to requested logical names; missing handling unchanged
T5 Integration tests (table function + engine, s3/azure/local) I8: same logical result across read entrypoints and storages

Key invariants:

  • I1: Mapping is by field-id, not by textual path parsing.
  • I2: Top-level mapped names may include dots and must not be split.
  • I3: Nested mapped child names may include dots and must keep all segments after parent prefix.
  • I4: Path composition remains consistent between traversal and tuple lookup.
  • I5: Error contracts for missing/duplicate fields remain consistent with prior behavior.

4) Logical code-path testing summary

Reviewed reachable branches in changed logic:

  • column_mapper == nullptr -> unchanged passthrough (element.name).
  • element without field_id -> unchanged passthrough (element.name).
  • field_id missing in map -> specification violation exception.
  • current_path.empty() (top-level) -> returns full mapped name (new behavior).
  • nested path with exact current_path + "." prefix -> strips only parent prefix, preserving dotted child name.
  • nested path without expected prefix -> returns full mapped string (fallback branch).

Expected outcomes:

  • Success path: top-level dotted names are preserved and selected correctly.
  • Success path: nested dotted names are preserved after parent-prefix stripping.
  • Failure path: invalid field-id mapping still fail-closed with exception.
  • Contract consistency: tuple lookup-by-name uses mapped child names, avoiding prior last-dot truncation issue.

Malformed input / integration failure / timing branches:

  • Malformed mapping (missing field id): covered statically, fail-closed.
  • External metadata inconsistency: covered statically through exception paths.
  • Concurrency/timing: changed code is local, stateless string-view logic; no shared mutable state introduced.

5) Fault categories and category-by-category injection results

Fault category Status Outcome Defects found Notes
Mapping correctness (field-id -> logical name) Executed Pass 0 Top-level and nested behavior aligned with intended Iceberg mapping semantics
Path-prefix stripping and dotted-name preservation Executed Pass 0 New prefix-based stripping fixes last-segment truncation behavior
Error-contract consistency (missing field-id/mapping) Executed Pass 0 Existing fail-closed exceptions preserved
Integration parity (table function vs table engine) Executed Pass 0 New tests validate both access paths
Concurrency/race/deadlock Executed Pass 0 No locks/shared mutable state touched by this change
C++ lifetime / iterator invalidation / ownership Executed Pass 0 No new ownership or container-lifetime hazards in changed lines
Integer overflow/signedness Not Applicable N/A 0 No arithmetic or signedness-sensitive changes
Rollback/partial-update safety Not Applicable N/A 0 No mutation sequence/state transaction introduced
Performance/resource failure Executed Pass 0 Constant-time string prefix checks; no added allocations on hot path beyond existing String emplace in tuple list

Fault-category completion matrix status: complete (all in-scope categories executed or explicitly N/A with rationale).

6) Confirmed defects (High/Medium/Low)

No confirmed defects found in this PR from static analysis.

7) Coverage accounting + stop-condition status

Call-graph coverage:

  • Covered nodes: prepareForReading, inferSchema, processSubtree, processSubtreeTuple, useColumnMapperIfNeeded, ColumnMapper mapping source interactions.
  • Uncovered nodes: unrelated primitive decoding branches not impacted by this PR.

Transition coverage:

  • Reviewed transitions: T1-T5 (all in-scope transitions for changed behavior).
  • Skipped transitions: none in changed logic.

Fault-category coverage:

  • Executed: 7
  • Not Applicable: 2
  • Deferred: 0

Coverage stop-condition:

  • Met. Every in-scope call-graph node, transition, and fault category is reviewed or explicitly marked N/A with justification.

8) Assumptions & Limits

  • Static reasoning only; no runtime execution in this audit.
  • New integration test file was reviewed from PR diff artifact (not executed locally in this pass).
  • Confidence is limited by absence of live Spark/Iceberg environment and sanitizer runtime evidence for this specific branch.
  • Cross-version/backport drift outside changed lines was not re-audited end-to-end.

9) Confidence rating and confidence-raising evidence

Overall confidence: Medium-High.

Why not full High:

  • No dynamic run of the newly added integration tests in this pass.
  • No direct replay against varied real-world Iceberg schemas beyond logical reasoning.

Evidence that would raise confidence to High:

  • Execute tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py across s3, azure, and local.
  • Run targeted Parquet/Iceberg regression suites on this backport branch.
  • Confirm sanitizer-backed CI jobs for this PR branch (ASAN/TSAN/MSAN/UBSAN) pass.

10) Residual risks and untested paths

  • Residual risk: fallback branch in useColumnMapperIfNeeded (return mapped when prefix mismatch) depends on metadata/path consistency; behavior is reasonable but only dynamically provable with adversarial schemas.
  • Untested path: schemas with unusual naming collisions that are logically distinct in Iceberg metadata but textually similar when dot-joined.
  • Untested path: large/high-cardinality nested schemas for performance profiling of repeated mapping checks.
  • Cross-partition interaction risk: low, because change surface is localized to schema-name mapping and test coverage directly targets regression scenario.

@Selfeer Selfeer added the verified Approved for release label Mar 5, 2026
@mkmkme mkmkme added antalya-obsolete PR exists in higher Altinity release (doesn’t need forward porting) and removed port-antalya PRs to be ported to all new Antalya releases labels Apr 21, 2026
@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented Apr 21, 2026

Marking as antalya-obsolete since it was included in v26.2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.1 antalya-26.1.6.20001 antalya-obsolete PR exists in higher Altinity release (doesn’t need forward porting) backport Backport verified Approved for release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants