feat(parquet): add experimental VECTOR repetition for Arrow FixedSizeList#854
feat(parquet): add experimental VECTOR repetition for Arrow FixedSizeList#854rok wants to merge 4 commits into
Conversation
3633118 to
1ce50db
Compare
4256bc5 to
3c03e6d
Compare
Review notesRebased locally onto current Correctness — looks right
Worth addressing before graduating from experimental
Minor
Format / specThe Thrift choices ( |
|
Follow-up — format/spec conformance I checked the hand-applied Thrift against the in-flight parquet-format proposal.
The enum value matches, but the Secondary, lower confidence: the Parquet C++ Option B prototype ( |
3c03e6d to
6bce5b7
Compare
6bce5b7 to
e667d46
Compare
|
Thanks for the very thorough review, @zeroshade — much appreciated. Pushed Format/spec — field-id alignment
Worth addressing before graduating
Minor
Left the statistics semantics and the group-vs-leaf representation as open proposal-convergence questions rather than changing them here. Thanks again! |
11d2409 to
ae93a6a
Compare
|
Re-reviewed at Remaining before this graduates from experimental — none blocking the proposal:
Everything else (the DataPageV1 no-offset-index double scan, element-level statistics, and leaf-vs-group representation) is fine to leave as documented proposal decisions. |
…List Add an experimental Parquet VECTOR FieldRepetitionType and map Arrow FixedSizeList<T, N> onto it, opt-in via pqarrow.WithVectorEncoding(). VECTOR stores fixed-shape-list data (e.g. embeddings) without per-element rep/def levels, dropping the 3-level LIST overhead for dense vectors. This implements leaf-only: a VECTOR column is a single primitive leaf carrying vector_length (vector <element-type> <name> [N]), not a nested group. Only dense, non-nullable, top-level FixedSizeList with a fixed-width primitive element is encoded as VECTOR; everything else falls back to LIST. A VECTOR leaf adds no def/rep level, so the writer counts rows as values/vector_length, keeps each vector on a single page, and the reader rebuilds the FixedSizeList from the schema. Format additions: FieldRepetitionType.VECTOR = 3 and SchemaElement.vector_length (field id 11), hand-applied to the generated parquet.go (Thrift 0.21.0 style) with parquet_vector.thrift as the IDL source of truth. Not yet in apache/parquet-format: files written with VECTOR are unreadable by readers that don't understand it.
…n coverage Address review feedback: add write->read tests asserting that VECTOR-ineligible FixedSizeList columns (nullable value type, nullable element, nested FixedSizeList element, and variable-width/string element) transparently fall back to the standard LIST encoding and round-trip losslessly; that multiple VECTOR columns and a VECTOR column mixed with a plain primitive column encode and read back correctly; and an explicit per-row-group parent-row-count check (rows, not leaf slots) across more than one row group.
f9340c4 to
e0e677b
Compare
8b515d9 to
52ca642
Compare
86793bf to
bff0188
Compare
bff0188 to
cbf51f9
Compare
cbf51f9 to
fcad933
Compare
|
Re-reviewed at Format/spec conformance — now matches the draft exactlyI diffed the hand-applied Thrift against the current parquet-format draft (GH-430,
The hand-applied Schema / level handling — solid
Prior items — status
Remaining before graduating from experimental (none blocking the proposal)
Everything I flagged previously is either resolved or consciously deferred with a doc note. For an experimental, opt-in proposal this is in good shape, and the move to the annotated 3-level group materially improves cross-implementation interop. |
DO NOT MERGE. At this point this is a proposal meant to support discussion about a change to parquet format.
Rationale for this change
Arrow
FixedSizeList<T, N>(embeddings, multidimensional array scientific data, etc) round-trips through Parquet today as a standard 3-levelLIST, paying per-element repetition/definition levels for a shape that is fixed and known from the schema. On C++ we showed ~2-10x read improved performance is possible which motivates a denser encoding.This adds an experimental Parquet
VECTORrepetition type - the "Option B" design from the Fixed-size list type for Parquet proposal - that stores fixed-shape data without those inner levels.Closes #855.
What changes are included in this PR?
FieldRepetitionType.VECTOR = 3andSchemaElement.vector_length(field id 11), hand-applied to the generatedparquet/internal/gen-go/parquet/parquet.goin the existing Thrift 0.21.0 generator style;parquet/parquet_vector.thriftvendors the IDL fragment as the source of truth.VECTORleaf node (NewPrimitiveNodeLogicalVector),vector_lengthplumbing, level computation (VECTORadds no def/rep level), andNewSchemaChecked, which returns an error instead of panicking on a malformedVECTORschema.values / vector_lengthand keeps every data page on a whole-vector boundary across all write paths (WriteBatch,WriteBatchSpaced,WriteBitmapBatchSpaced, dictionary indices, FLBA); the reader supports row-ordinal seeking by value stride and rejects malformedVECTORchunks (num_valuesnot a whole multiple of, or inconsistent with, the row count).WithVectorEncoding(); eligible top-levelFixedSizeListcolumns are written asVECTORand reconstructed on read without a stored Arrow schema, and ineligible ones fall back toLIST. Works alongsideWithStoreSchema(element timezone / field metadata are restored).A
VECTORcolumn is a single primitive leaf (vector <element-type> <name> [N]), not a nested group — the leaf carriesvector_lengthand adds no definition/repetition level, so a dense vector has no inner levels.Scope: dense, non-nullable, top-level
FixedSizeListwith a fixed-width primitive element. Every otherFixedSizeListtransparently falls back toLIST; nothing that writes today changes unless the flag is set.Are these changes tested?
Yes. New tests cover:
LISTfallback for every ineligible case;VECTORdata and of malformedVECTORfiles;WithStoreSchemaround-trip; and the schema/Thrift compact-protocol round-trip.All new tests pass; the only failing tests in the suite are the pre-existing ones that need the
parquet-testingdata submodule /PARQUET_TEST_DATA.Are there any user-facing changes?
Yes:
pqarrow.WithVectorEncoding()(default off) and new schema helpersschema.NewPrimitiveNodeLogicalVector/schema.NewSchemaChecked/schema.NewColumnChecked.file.ErrVectorBatchMisaligned, returned by the typed column writers when a VECTOR column is given a partial-vector batch.parquet.Repetitions.Vectorvalue;parquet.Repetitions.Undefinedshifts from3to4.VECTORare not readable by Parquet implementations that don't understand theVECTORrepetition type. This is the defining trade-off of VECTOR repetition type and the reason it is strictly opt-in and documented experimental, untilVECTORis standardized in apache/parquet-format.Potential follow-up: nullable vectors (spaced leaf materialization + def-level→validity collapse), struct elements, and nested vectors.