Skip to content

Parquet: omit min/max for interval columns when writing stats#5147

Merged
tustvold merged 2 commits into
apache:masterfrom
Jefffrey:omit_interval_min_max_stats
Nov 30, 2023
Merged

Parquet: omit min/max for interval columns when writing stats#5147
tustvold merged 2 commits into
apache:masterfrom
Jefffrey:omit_interval_min_max_stats

Conversation

@Jefffrey

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #5145

Rationale for this change

What changes are included in this PR?

Add extra checks before calculating min/max for chunks/pages, to ignore Interval columns

Are there any user-facing changes?

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Nov 29, 2023

@tustvold tustvold left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

@tustvold

Copy link
Copy Markdown
Contributor

What ColumnOrder are we currently writing for these columns?

@Jefffrey

Copy link
Copy Markdown
Contributor Author

What ColumnOrder are we currently writing for these columns?

I'm not sure, actually. I tried running this test in arrow_writer/mod.rs on master branch:

    #[test]
    fn test_123() {
        let a = Int32Array::from(vec![1, 2, 3, 4, 5]);
        let b = IntervalDayTimeArray::from(vec![0; 5]);
        let batch = RecordBatch::try_from_iter(vec![
            ("a", Arc::new(a) as ArrayRef),
            ("b", Arc::new(b) as ArrayRef),
        ])
        .unwrap();

        let mut buf = Vec::with_capacity(1024);
        let mut writer = ArrowWriter::try_new(&mut buf, batch.schema(), None).unwrap();
        writer.write(&batch).unwrap();
        writer.close().unwrap();

        let bytes = Bytes::from(buf);
        let options = ReadOptionsBuilder::new().with_page_index().build();
        let reader = SerializedFileReader::new_with_options(bytes, options).unwrap();
        dbg!(reader.metadata().file_metadata().column_orders());
    }

Running:

arrow-rs$ cargo test -p parquet --lib arrow::arrow_writer::tests::test_123 -- --nocapture --exact
    Blocking waiting for file lock on build directory
   Compiling parquet v49.0.0 (/home/jeffrey/Code/arrow-rs/parquet)
    Finished test [unoptimized + debuginfo] target(s) in 11.49s
     Running unittests src/lib.rs (/media/jeffrey/1tb_860evo_ssd/.cargo_target_cache/debug/deps/parquet-a4f7a499e85a325c)

running 1 test
[parquet/src/arrow/arrow_writer/mod.rs:2760] reader.metadata().file_metadata().column_orders() = None
test arrow::arrow_writer::tests::test_123 ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 667 filtered out; finished in 0.00s

Even when I change it to only write the Int32Array, it is still none.

Not sure if I'm doing something wrong here?

@Jefffrey

Jefffrey commented Nov 30, 2023

Copy link
Copy Markdown
Contributor Author

I noticed this:

let file_metadata = parquet::FileMetaData {
num_rows,
row_groups,
key_value_metadata,
version: self.props.writer_version().as_num(),
schema: types::to_thrift(self.schema.as_ref())?,
created_by: Some(self.props.created_by().to_owned()),
column_orders: None,
encryption_algorithm: None,
footer_signing_key_metadata: None,
};

  • Unsure if there are other places to consider

Looks like might be a separate issue, to implement writing ColumnOrder

@tustvold tustvold merged commit f621d28 into apache:master Nov 30, 2023
@Jefffrey Jefffrey deleted the omit_interval_min_max_stats branch November 30, 2023 19:10
@Jefffrey

Copy link
Copy Markdown
Contributor Author

Raised #5152 for the column order issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet: Interval columns shouldn't write min/max stats

2 participants