fix(parquet): support nested column reading in page-level filtered row groups (fallback to RowGroup reading) #363
Open
zhf999 wants to merge 10 commits into
Open
fix(parquet): support nested column reading in page-level filtered row groups (fallback to RowGroup reading) #363zhf999 wants to merge 10 commits into
zhf999 wants to merge 10 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR improves page-level filtering for Parquet reads when nested Arrow types (struct/list/map) are present by introducing a shared schema-flattening utility, building a leaf→field mapping, and adding a nested-column fallback read path.
Changes:
- Moved
FlattenSchemaintoparquet_schema_utiland updated call sites accordingly. - Extended page-filtered row group reads to support nested columns via Arrow
FileReader+Takefiltering. - Added new unit tests covering page filtering with nested struct/list/map columns and adjacent nested fields.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/format/parquet/parquet_schema_util.h | Declares shared FlattenSchema helper for leaf index flattening. |
| src/paimon/format/parquet/parquet_schema_util.cpp | Implements FlattenSchema in a common utility. |
| src/paimon/format/parquet/parquet_file_batch_reader.h | Removes in-class FlattenSchema implementation. |
| src/paimon/format/parquet/parquet_file_batch_reader.cpp | Includes schema util header for shared flattening logic. |
| src/paimon/format/parquet/page_filtered_row_group_reader.h | Adds nested-column fallback APIs and adjusts ReadFilteredRowGroup signature. |
| src/paimon/format/parquet/page_filtered_row_group_reader.cpp | Implements nested-column read + Take filtering and column assembly logic. |
| src/paimon/format/parquet/file_reader_wrapper.h | Stores num_cols_ and leaf_to_field_idx_ for nested schema mapping. |
| src/paimon/format/parquet/file_reader_wrapper.cpp | Builds leaf→field mapping, adjusts prebuffering for nested columns, and passes mapping to row group reader. |
| src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp | Adds regression + coverage tests for nested page filtering behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+233
to
+241
| Result<std::shared_ptr<arrow::Array>> PageFilteredRowGroupReader::BuildTakeIndices( | ||
| const RowRanges& row_ranges, int64_t expected_rows, std::shared_ptr<::arrow::MemoryPool> pool) { | ||
| arrow::Int64Builder builder(pool.get()); | ||
| PAIMON_RETURN_NOT_OK_FROM_ARROW(builder.Reserve(expected_rows)); | ||
| for (const auto& range : row_ranges.GetRanges()) { | ||
| for (int64_t row = range.from; row <= range.to; ++row) { | ||
| builder.UnsafeAppend(row); | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
When page-level filtering (via ColumnIndex/OffsetIndex) is enabled, reading nested columns (struct, list, map) fails because the page-level skip/read pattern in
RecordReaderdoes not correctly handle repetition/definition levels of nested Parquet columns.This PR fixes the issue by falling back to RowGroup-level reading for nested columns, then filtering rows with
arrow::compute::Take. Non-nested (flat) columns continue to use page-level skip/read as before.Key changes:
PageFilteredRowGroupReader: AddedReadNestedColumns()which reads the entire row group for nested columns viaarrow::FileReader::ReadRowGroup()and appliesarrow::compute::Takewith row indices built fromRowRanges. AddedAssembleFilteredColumns()to dispatch each field to either page-filtered reading (flat) or the pre-read nested columns map. AddedBuildTakeIndices()to convertRowRangesinto an Int64 index array for Take.ReadFilteredRowGroupnow acceptsarrow::FileReader*andleaf_to_field_idxto support the nested column fallback path.FileReaderWrapper: Addedleaf_to_field_idx_(mapping from Parquet leaf column index to owning Arrow field index, -1 for non-nested) and passes it through toReadFilteredRowGroup. UpdatedBuildPageFilteredSchema()to useFlattenSchema()for correct nested-to-leaf mapping instead of assuming 1:1 leaf-to-field correspondence. UpdatedCollectPreBufferRanges()to pre-buffer entire column chunks for nested columns (since they use full-RG read) while still using page-level ranges for flat columns.ParquetFileBatchReader: Removed stale member variables (read_ranges_,read_row_groups_,read_column_indices_) that are no longer needed after the refactoring inzhf-refractor.parquet_schema_util: AddedFlattenSchema()to recursively flatten nested Arrow types into constituent Parquet leaf column indices.Tests
NestedStructColumnPageFilter: struct column with page filtering, reading both flat and nested fields.NestedStructColumnOnlyReadNestedField: reading only the nested struct column (predicate column excluded from read schema).NestedListColumnPageFilter: list column with page filtering.NestedMapColumnPageFilter: map column with page filtering.MultipleAdjacentNestedColumns: two adjacent nested columns (struct + list) in the same schema.API and Format
No. Internal implementation change only. No public API or storage format changes.
Documentation
No.
Generative AI tooling
Claude opus 4.7