Skip to content

Support an experimental Parquet VECTOR repetition type for Arrow FixedSizeList #855

Description

@rok

Problem

Arrow FixedSizeList<T, N> is the natural type for fixed-shape data — embeddings, images, multidimensional array scientific data - where every value has exactly N elements and the shape is fixed and known from the schema. Today pqarrow round-trips it through Parquet as a standard 3-level LIST, writing per-element repetition and definition levels for a length that never varies. For wide dense vectors that is pure overhead; on C++ we showed ~2-10x read improved performance is possible which motivates a denser encoding.

Proposal

Add an experimental Parquet VECTOR FieldRepetitionType that stores a fixed number of element values per row directly, without per-element rep/def levels, and map Arrow FixedSizeList onto it. This is the "Option B" design from the Fixed-size list type for Parquet proposal (and the arrow-cpp prototype, rok/arrow#51).

For initial proposal this is, leaf-only, but we leave the door open to potentially allow non-leaf cases later:

  • A VECTOR column is a single primitive leaf carrying vector_length (vector <element-type> <name> [N]), not a nested group.
  • Only dense, non-nullable, top-level FixedSizeList columns with a fixed-width primitive element are encoded as VECTOR. Everything else (nullable value or element, zero-length, variable-width/dictionary/extension/struct/nested-list element, or a nested FixedSizeList) transparently falls back to the standard LIST encoding. Nullable, struct, and nested vectors are follow-ups.
  • Opt-in on the writer via pqarrow.WithVectorEncoding(); reading is automatic.

Format additions (not yet in apache/parquet-format): FieldRepetitionType.VECTOR = 3 and SchemaElement.vector_length (field id 11).

Caveat

VECTOR is not part of apache/parquet-format yet, so this is strictly opt-in and non-portable: files written with VECTOR are rejected by readers that don't understand the repetition type.

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions