Skip to content

fix(parquet): support nested column reading in page-level filtered row groups (fallback to RowGroup reading) #363

Open
zhf999 wants to merge 10 commits into
alibaba:mainfrom
zhf999:nested-col-fix
Open

fix(parquet): support nested column reading in page-level filtered row groups (fallback to RowGroup reading) #363
zhf999 wants to merge 10 commits into
alibaba:mainfrom
zhf999:nested-col-fix

Conversation

@zhf999

@zhf999 zhf999 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Purpose

When page-level filtering (via ColumnIndex/OffsetIndex) is enabled, reading nested columns (struct, list, map) fails because the page-level skip/read pattern in RecordReader does not correctly handle repetition/definition levels of nested Parquet columns.

This PR fixes the issue by falling back to RowGroup-level reading for nested columns, then filtering rows with arrow::compute::Take. Non-nested (flat) columns continue to use page-level skip/read as before.

Key changes:

  • PageFilteredRowGroupReader: Added ReadNestedColumns() which reads the entire row group for nested columns via arrow::FileReader::ReadRowGroup() and applies arrow::compute::Take with row indices built from RowRanges. Added AssembleFilteredColumns() to dispatch each field to either page-filtered reading (flat) or the pre-read nested columns map. Added BuildTakeIndices() to convert RowRanges into an Int64 index array for Take. ReadFilteredRowGroup now accepts arrow::FileReader* and leaf_to_field_idx to support the nested column fallback path.
  • FileReaderWrapper: Added leaf_to_field_idx_ (mapping from Parquet leaf column index to owning Arrow field index, -1 for non-nested) and passes it through to ReadFilteredRowGroup. Updated BuildPageFilteredSchema() to use FlattenSchema() for correct nested-to-leaf mapping instead of assuming 1:1 leaf-to-field correspondence. Updated CollectPreBufferRanges() to pre-buffer entire column chunks for nested columns (since they use full-RG read) while still using page-level ranges for flat columns.
  • ParquetFileBatchReader: Removed stale member variables (read_ranges_, read_row_groups_, read_column_indices_) that are no longer needed after the refactoring in zhf-refractor.
  • parquet_schema_util: Added FlattenSchema() to recursively flatten nested Arrow types into constituent Parquet leaf column indices.

Tests

  • NestedStructColumnPageFilter: struct column with page filtering, reading both flat and nested fields.
  • NestedStructColumnOnlyReadNestedField: reading only the nested struct column (predicate column excluded from read schema).
  • NestedListColumnPageFilter: list column with page filtering.
  • NestedMapColumnPageFilter: map column with page filtering.
  • MultipleAdjacentNestedColumns: two adjacent nested columns (struct + list) in the same schema.

API and Format

No. Internal implementation change only. No public API or storage format changes.

Documentation

No.

Generative AI tooling

Claude opus 4.7

Copilot AI review requested due to automatic review settings June 12, 2026 03:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves page-level filtering for Parquet reads when nested Arrow types (struct/list/map) are present by introducing a shared schema-flattening utility, building a leaf→field mapping, and adding a nested-column fallback read path.

Changes:

  • Moved FlattenSchema into parquet_schema_util and updated call sites accordingly.
  • Extended page-filtered row group reads to support nested columns via Arrow FileReader + Take filtering.
  • Added new unit tests covering page filtering with nested struct/list/map columns and adjacent nested fields.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/paimon/format/parquet/parquet_schema_util.h Declares shared FlattenSchema helper for leaf index flattening.
src/paimon/format/parquet/parquet_schema_util.cpp Implements FlattenSchema in a common utility.
src/paimon/format/parquet/parquet_file_batch_reader.h Removes in-class FlattenSchema implementation.
src/paimon/format/parquet/parquet_file_batch_reader.cpp Includes schema util header for shared flattening logic.
src/paimon/format/parquet/page_filtered_row_group_reader.h Adds nested-column fallback APIs and adjusts ReadFilteredRowGroup signature.
src/paimon/format/parquet/page_filtered_row_group_reader.cpp Implements nested-column read + Take filtering and column assembly logic.
src/paimon/format/parquet/file_reader_wrapper.h Stores num_cols_ and leaf_to_field_idx_ for nested schema mapping.
src/paimon/format/parquet/file_reader_wrapper.cpp Builds leaf→field mapping, adjusts prebuffering for nested columns, and passes mapping to row group reader.
src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp Adds regression + coverage tests for nested page filtering behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/paimon/format/parquet/file_reader_wrapper.cpp Outdated
Comment thread src/paimon/format/parquet/file_reader_wrapper.cpp Outdated
Comment thread src/paimon/format/parquet/page_filtered_row_group_reader.cpp Outdated
Comment on lines +233 to +241
Result<std::shared_ptr<arrow::Array>> PageFilteredRowGroupReader::BuildTakeIndices(
const RowRanges& row_ranges, int64_t expected_rows, std::shared_ptr<::arrow::MemoryPool> pool) {
arrow::Int64Builder builder(pool.get());
PAIMON_RETURN_NOT_OK_FROM_ARROW(builder.Reserve(expected_rows));
for (const auto& range : row_ranges.GetRanges()) {
for (int64_t row = range.from; row <= range.to; ++row) {
builder.UnsafeAppend(row);
}
}
Comment thread src/paimon/format/parquet/parquet_schema_util.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants