fix(parquet): support nested column reading in page-level filtered row groups (fallback to RowGroup reading) by zhf999 · Pull Request #363 · alibaba/paimon-cpp

zhf999 · 2026-06-12T03:31:09Z

Purpose

When page-level filtering (via ColumnIndex/OffsetIndex) is enabled, reading nested columns (struct, list, map) fails because the page-level skip/read pattern in RecordReader does not correctly handle repetition/definition levels of nested Parquet columns.

This PR fixes the issue by falling back to RowGroup-level reading for nested columns, then filtering rows with arrow::compute::Take. Non-nested (flat) columns continue to use page-level skip/read as before.

Key changes:

PageFilteredRowGroupReader: Added ReadNestedColumns() which reads the entire row group for nested columns via arrow::FileReader::ReadRowGroup() and applies arrow::compute::Take with row indices built from RowRanges. Added AssembleFilteredColumns() to dispatch each field to either page-filtered reading (flat) or the pre-read nested columns map. Added BuildTakeIndices() to convert RowRanges into an Int64 index array for Take. ReadFilteredRowGroup now accepts arrow::FileReader* and leaf_to_field_idx to support the nested column fallback path.
FileReaderWrapper: Added leaf_to_field_idx_ (mapping from Parquet leaf column index to owning Arrow field index, -1 for non-nested) and passes it through to ReadFilteredRowGroup. Updated BuildPageFilteredSchema() to use FlattenSchema() for correct nested-to-leaf mapping instead of assuming 1:1 leaf-to-field correspondence. Updated CollectPreBufferRanges() to pre-buffer entire column chunks for nested columns (since they use full-RG read) while still using page-level ranges for flat columns.
ParquetFileBatchReader: Removed stale member variables (read_ranges_, read_row_groups_, read_column_indices_) that are no longer needed after the refactoring in zhf-refractor.
parquet_schema_util: Added FlattenSchema() to recursively flatten nested Arrow types into constituent Parquet leaf column indices.

Tests

NestedStructColumnPageFilter: struct column with page filtering, reading both flat and nested fields.
NestedStructColumnOnlyReadNestedField: reading only the nested struct column (predicate column excluded from read schema).
NestedListColumnPageFilter: list column with page filtering.
NestedMapColumnPageFilter: map column with page filtering.
MultipleAdjacentNestedColumns: two adjacent nested columns (struct + list) in the same schema.

API and Format

No. Internal implementation change only. No public API or storage format changes.

Documentation

No.

Generative AI tooling

Claude opus 4.7

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves page-level filtering for Parquet reads when nested Arrow types (struct/list/map) are present by introducing a shared schema-flattening utility, building a leaf→field mapping, and adding a nested-column fallback read path.

Changes:

Moved FlattenSchema into parquet_schema_util and updated call sites accordingly.
Extended page-filtered row group reads to support nested columns via Arrow FileReader + Take filtering.
Added new unit tests covering page filtering with nested struct/list/map columns and adjacent nested fields.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/paimon/format/parquet/parquet_schema_util.h	Declares shared `FlattenSchema` helper for leaf index flattening.
src/paimon/format/parquet/parquet_schema_util.cpp	Implements `FlattenSchema` in a common utility.
src/paimon/format/parquet/parquet_file_batch_reader.h	Removes in-class `FlattenSchema` implementation.
src/paimon/format/parquet/parquet_file_batch_reader.cpp	Includes schema util header for shared flattening logic.
src/paimon/format/parquet/page_filtered_row_group_reader.h	Adds nested-column fallback APIs and adjusts `ReadFilteredRowGroup` signature.
src/paimon/format/parquet/page_filtered_row_group_reader.cpp	Implements nested-column read + `Take` filtering and column assembly logic.
src/paimon/format/parquet/file_reader_wrapper.h	Stores `num_cols_` and `leaf_to_field_idx_` for nested schema mapping.
src/paimon/format/parquet/file_reader_wrapper.cpp	Builds leaf→field mapping, adjusts prebuffering for nested columns, and passes mapping to row group reader.
src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp	Adds regression + coverage tests for nested page filtering behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+Result<std::shared_ptr<arrow::Array>> PageFilteredRowGroupReader::BuildTakeIndices(
+    const RowRanges& row_ranges, int64_t expected_rows, std::shared_ptr<::arrow::MemoryPool> pool) {
+    arrow::Int64Builder builder(pool.get());
+    PAIMON_RETURN_NOT_OK_FROM_ARROW(builder.Reserve(expected_rows));
+    for (const auto& range : row_ranges.GetRanges()) {
+        for (int64_t row = range.from; row <= range.to; ++row) {
+            builder.UnsafeAppend(row);
+        }
+    }


zhf999 added 4 commits June 12, 2026 11:23

rebase: resolve conflics

dd25c8a

refractor: split ReadFilteredRowGroup into smaller functions

a6d64a1

fix: add the missing brace in page_filtered_row_group_reader_test.cpp

36343e2

fix: remove duplicated variable declaring

60b7cd2

Copilot AI review requested due to automatic review settings June 12, 2026 03:31

Copilot AI reviewed Jun 12, 2026

View reviewed changes

zhf999 and others added 6 commits June 12, 2026 14:22

fix: use field index instead of field name to map nested columns

b8fbd09

refractor: removed redundant parameteres

e66d7f6

fix: the order of read schema is no more sorted automatically

b2141e9

refractor: change the loop base.

9282168

Merge branch 'main' into nested-col-fix

7152092

style: remove unused variables

58d2f8b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parquet): support nested column reading in page-level filtered row groups (fallback to RowGroup reading) #363

fix(parquet): support nested column reading in page-level filtered row groups (fallback to RowGroup reading) #363
zhf999 wants to merge 10 commits into
alibaba:mainfrom
zhf999:nested-col-fix

zhf999 commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhf999 commented Jun 12, 2026

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants