Skip to content

Add TimestampWithOffset canonical extension type#558

Open
serramatutu wants to merge 39 commits into
apache:mainfrom
serramatutu:serramatutu/TimestampWithOffset/go
Open

Add TimestampWithOffset canonical extension type#558
serramatutu wants to merge 39 commits into
apache:mainfrom
serramatutu:serramatutu/TimestampWithOffset/go

Conversation

@serramatutu

@serramatutu serramatutu commented Oct 30, 2025

Copy link
Copy Markdown
Contributor

NOTE: This is a WIP that is rebased onto 2 separate changes I have not yet upstreamed.

Which issue does this PR close?

This PR implements the new arrow.timestamp_with_offset canonical extension type for arrow-go.

Rationale for this change

Be compatible with Arrow spec.

What changes are included in this PR?

This commit adds a new TimestampWithOffset extension type. This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. The offset in minutes can be primitive encoded, dictionary encoded or run-end encoded.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, this is a new canonical extension type.

@felipecrv felipecrv self-requested a review November 1, 2025 01:55
felipecrv added a commit to apache/arrow that referenced this pull request Dec 5, 2025
…48002)

### Rationale for this change

Closes #44248

Arrow has no built-in canonical way of representing the `TIMESTAMP WITH
TIME ZONE` SQL type, which is present across multiple different database
systems. Not having a native way to represent this forces users to
either convert to UTC and drop the time zone, which may have correctness
implications, or use bespoke workarounds. A new
`arrow.timestamp_with_offset` extension type would introduce a standard
canonical way of representing that information.

Rust implementation: apache/arrow-rs#8743
Go implementation: apache/arrow-go#558

[DISCUSS] [thread in the mailing
list](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk).

### What changes are included in this PR?

Proposal and documentation for `arrow.timestamp_with_offset` canonical
extension type.

### Are these changes tested?

N/A

### Are there any user-facing changes?

Yes, this is an extension to the arrow format.

* GitHub Issue: #44248

---------

Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch 3 times, most recently from 95230ad to b9b8bf2 Compare December 22, 2025 14:33
@serramatutu serramatutu changed the title [DRAFT] Add TimestampWithOffset extension type Add TimestampWithOffset extension type Dec 22, 2025
@serramatutu serramatutu marked this pull request as ready for review December 22, 2025 14:42
@serramatutu serramatutu changed the title Add TimestampWithOffset extension type Add TimestampWithOffset canonical extension type Dec 22, 2025
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch from b9b8bf2 to ccdd288 Compare December 22, 2025 15:17
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment on lines +506 to +507
timestamps.UnsafeAppendBoolToBitmap(false)
offsets.UnsafeAppendBoolToBitmap(false)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, these are non-nullable according to the spec.

@serramatutu serramatutu Apr 15, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did change this but now the IPC roundtrip tests are failing because the internal count of nulls is different between the thing before IPC and after IPC.

To make them pass we'll need to relax the arrow.RecordEqual() to ignore NullN() when the field is not nullable. Alternatively we can change it so that an array builder sets nulls = 0 if the array is not nullable, regardless of what has been set before in calls to UnsafeAppendBoolToBitmap() etc

Related: https://lists.apache.org/thread/7gbqjwykh1ob3xbvwph3ljsdl5c7kxpd

Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset_test.go Outdated
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch from ccdd288 to cee17a3 Compare January 30, 2026 12:21
@serramatutu serramatutu requested a review from zeroshade February 2, 2026 13:29
Comment thread arrow/array/encoded.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment on lines +151 to +152
*arrow.Int8Type | *arrow.Int16Type | *arrow.Int32Type | *arrow.Int64Type |
*arrow.Uint8Type | *arrow.Uint16Type | *arrow.Uint32Type | *arrow.Uint64Type

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the DictIndexType are both identical, instead of duplicating this we should either create a single type constraint or embed one in the other:

type TimestampWithOffsetRunEndsType interface {
     DictIndexType
}

My personal preference would be to have a single type though, but I'm not averse to having the two separate ones if necessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: 231cb1e

@serramatutu serramatutu Apr 15, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I had to revert. Spec says those types are different:

Dict index is either signed or unsigned int of 8-64 bits

The index type of a Dictionary type can only be an integer type, preferably signed, with width 8 to 64 bits.

Run ends is a signed int of 16-64 bits

The run end type of a Run-End Encoded type can only be a signed integer type with width 16 to 64 bits.

https://arrow.apache.org/docs/format/Columnar.html#data-types

Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset_test.go Outdated
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch 3 times, most recently from bc577e9 to 8a7d6d7 Compare April 15, 2026 11:05
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch 2 times, most recently from f060911 to 9f931fb Compare April 30, 2026 11:33
Mottl pushed a commit to Mottl/arrow that referenced this pull request May 26, 2026
…type (apache#48002)

### Rationale for this change

Closes apache#44248

Arrow has no built-in canonical way of representing the `TIMESTAMP WITH
TIME ZONE` SQL type, which is present across multiple different database
systems. Not having a native way to represent this forces users to
either convert to UTC and drop the time zone, which may have correctness
implications, or use bespoke workarounds. A new
`arrow.timestamp_with_offset` extension type would introduce a standard
canonical way of representing that information.

Rust implementation: apache/arrow-rs#8743
Go implementation: apache/arrow-go#558

[DISCUSS] [thread in the mailing
list](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk).

### What changes are included in this PR?

Proposal and documentation for `arrow.timestamp_with_offset` canonical
extension type.

### Are these changes tested?

N/A

### Are there any user-facing changes?

Yes, this is an extension to the arrow format.

* GitHub Issue: apache#44248

---------

Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
This commit separates the actual implementation from the public `*Equal`
functions. Now, all the public API does is convert the `opts
...EqualOption` into an `opt equalOption` struct and pass it into the
implementation.

This is useful to avoid having to reconstruct back and forth between the
two in nested call stacks. All the implementations care about is
`equalOption`, and `EqualOption` remains as a convenient thing only for
the public API.
This allows for 2 things:
1. Users can now explicitly pass into `EqualOption` if they want the
   comparison functions to compare the nullable buffer or not.
2. Struct and record comparison can change the value of the `nullable`
   option depending on `innerField.Nullable`, making the comparison
   semantically accurate.
Modified `StructBuilder` and `RecordBuilder` to append the empty default
value (usually zero) to the inner field if it's marked as nullable and
the consumed value is null.
The array implementations need a way of knowing whether to ignore the
valids buffer or not. By default, it shouldn't ignore it if the array is
being serialized by itself, like with `MarshalJSON()`. However, if the
array is a part of a `Field` in a `RecordBatch` or `Struct`, then the
value of the valids buffer might need to be ignored. This will be
implemented in the next commit.
This commit fixes some tests that were implicitly setting `valid=false`
on non-nullable fields, which now causes legitimate test failures when
roundtripping to JSON.
This commit adds a new `TimestampWithOffset` extension type that can be
used to represent timestamps with per-row timezone information. It
stores information in a `struct` with 2 fields, `timestamp=[T, "UTC"]`,
where `T` can be any `arrow.TimeUnit` and `offset_minutes=int16`, which
represents the offset in minutes from the UTC timestamp.
@zeroshade zeroshade force-pushed the serramatutu/TimestampWithOffset/go branch from 9f931fb to 49da897 Compare July 2, 2026 17:58
record.go/struct.go: call UseNumber() on the per-field sub-decoder in the nullable decode path so large int64 keeps full precision (regression against apache#816, surfaced by the rebase). json_reader_test.go: pass the nullable flag to the two-arg GetOneForMarshal. internal/json/json_stdlib.go: add IsNullMessage to the tinygo/stdlib json variant (it was only added to the goccy build), fixing the TinyGo 'undefined: json.IsNullMessage' example build.
@zeroshade zeroshade force-pushed the serramatutu/TimestampWithOffset/go branch from 49da897 to b6f73a4 Compare July 2, 2026 18:11
zeroshade added 3 commits July 2, 2026 15:34
A non-nullable list/union/struct can legitimately hold nullable children, so equality and JSON serialization must honor each child/element field's own nullability rather than propagating the parent's. Previously the parent nullability was pushed down, so null child slots were compared and serialized by their arbitrary underlying bytes -- which round-trips inconsistently and breaks cross-language integration (union, nested_large_offsets) even when values are logically equal.

compare.go: unions and run-end-encoded arrays have no top-level validity bitmap, so skip their top-level null-count/validity checks (and the all-null shortcut); recurse into list/struct/map/union children with the child field's nullability. union.go: SparseUnion/DenseUnion GetOneForMarshal use the selected child field's nullability and emit [typeID, null] for a null child so it round-trips as null.
Follow-up to the field-local nullability change (addresses review of 79cf180): the exact Equal path and the chunked/table helpers still used parent/top-level nullability. arrayEqualList/LargeList/ListView/LargeListView and arrayEqualFixedSizeList now recurse into elements with the element field's nullability (arrayEqualStruct was already field-local; arrayEqualMap delegates to arrayEqualList). chunkedEqual and chunkedApproxEqual now skip the top-level NullN check for unions and run-end-encoded arrays via hasTopLevelValidityBitmap, so Table comparisons don't reject logically-equal union/REE columns.
Addresses review of 6f92ce2: chunkedEqual called exported SliceEqual, which rebuilds default options and drops the caller's nullable setting, so TableEqual/ChunkedEqual with WithNullable(false) re-compared each chunk slice as nullable. Call the private sliceEqual(..., opt) instead, matching chunkedApproxEqual.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the canonical extension type arrow.timestamp_with_offset to match the Arrow spec, and updates JSON marshaling/unmarshaling plus array/record equality behavior to better respect schema nullability (especially for nested types and IPC-vs-JSON roundtrips).

Changes:

  • Add TimestampWithOffsetType, TimestampWithOffsetArray, and a convenience builder supporting primitive/dictionary/REE encodings for per-row timezone offsets.
  • Update JSON marshaling (GetOneForMarshal) to take schema nullability into account, and adjust struct/record JSON unmarshaling to handle explicit null for required fields.
  • Update array/record/table equality logic to optionally ignore validity bitmaps when fields are considered non-nullable, and refresh test fixtures accordingly.

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal/json/json.go Add IsNullMessage helper for null detection.
internal/json/json_stdlib.go Stdlib variant of IsNullMessage.
arrow/ipc/cmd/arrow-ls/main_test.go Update expected struct field nullability formatting in output.
arrow/internal/arrjson/arrjson_test.go Update expected JSON schema nullability for struct children.
arrow/internal/arrdata/arrdata.go Update struct test data to mark inner fields nullable and adjust validity masks.
arrow/extensions/variant.go Update GetOneForMarshal signature to include nullability.
arrow/extensions/uuid.go Update GetOneForMarshal signature + call sites.
arrow/extensions/uuid_test.go Mark UUID field nullable in record-builder test.
arrow/extensions/timestamp_with_offset.go New extension type implementation + builder/array logic.
arrow/extensions/timestamp_with_offset_test.go New tests covering primitive/dict/REE encodings, JSON roundtrip, IPC roundtrip.
arrow/extensions/json.go Make JSON extension marshaling honor caller-provided nullability.
arrow/extensions/extensions.go Register TimestampWithOffsetType as a canonical extension type.
arrow/extensions/bool8.go Update GetOneForMarshal signature to include nullability.
arrow/compute/vector_sort_test.go Mark fields nullable in C++ parity tests (to align with new nullability-aware behavior).
arrow/array/util.go Pass schema field nullability into GetOneForMarshal during Record-to-JSON.
arrow/array/util_test.go Exercise record JSON roundtrip under nullable vs non-nullable schemas.
arrow/array/union.go Make union JSON marshaling and approx-equality respect child-field nullability.
arrow/array/timestamp.go Update GetOneForMarshal and equality to respect nullable option.
arrow/array/struct.go Make struct JSON marshaling/unmarshaling nullability-aware; adjust struct equality accordingly.
arrow/array/struct_test.go Add coverage for explicit null in required struct fields.
arrow/array/string.go Update string array marshaling and equality to respect nullable option.
arrow/array/record.go Make record JSON unmarshaling treat explicit null in required fields as empty value.
arrow/array/record_test.go Extend record-builder tests for new null-handling + JSON roundtrip.
arrow/array/numeric_generic.go Update numeric marshaling and equality to respect nullable option.
arrow/array/null.go Update GetOneForMarshal signature.
arrow/array/map.go Thread nullable option through map equality.
arrow/array/list.go Update list marshaling and equality to respect element-field nullability.
arrow/array/json_reader_test.go Update JSON reader test to pass nullable flag when marshaling one value.
arrow/array/interval.go Update interval marshaling and equality to respect nullable option.
arrow/array/float16.go Update GetOneForMarshal signature to include nullability.
arrow/array/fixedsize_binary.go Update fixed-size-binary marshaling and equality to respect nullable option.
arrow/array/fixed_size_list.go Update fixed-size-list marshaling and equality to respect element-field nullability.
arrow/array/extension.go Thread nullable option through extension array equality and marshaling.
arrow/array/encoded.go Update run-end-encoded marshaling and equality to thread nullable option.
arrow/array/dictionary.go Update dictionary marshaling and equality to thread nullable option.
arrow/array/decimal256_test.go Update test call sites for new GetOneForMarshal signature.
arrow/array/decimal128_test.go Update test call sites for new GetOneForMarshal signature.
arrow/array/decimal.go Update decimal marshaling and equality to respect nullable option.
arrow/array/compare.go Add nullable-aware equality plumbing and new APIs; thread schema nullability into comparisons.
arrow/array/compare_test.go Add tests for WithNullable(false) behavior.
arrow/array/boolean.go Update boolean marshaling and equality to respect nullable option.
arrow/array/binary.go Update binary marshaling and equality to respect nullable option.
arrow/array.go Change GetOneForMarshal interface to accept nullability flag.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread arrow/array/compare.go Outdated
Comment thread arrow/array/compare.go Outdated
Comment thread arrow/array/compare.go Outdated
Comment thread arrow/array/compare.go Outdated
Comment thread arrow/array/record_test.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
zeroshade added 7 commits July 2, 2026 18:57
- array/compare.go: read the right-hand field from right.Schema() (not
  left.Schema()) in recordEqual, recordApproxEqual, tableEqual, and
  tableApproxEqual, so field/schema differences (name, nullability,
  metadata) between the two sides are actually detected instead of a
  field being compared against itself.
- extensions/timestamp_with_offset.go: fix the run-end-encoded offset
  builder in AppendValues. Start a run when none exists yet (e.g. leading
  null rows) instead of continuing a nonexistent run, and only update
  lastOffset when a new run starts. This stops a leading null from
  corrupting the run-ends/values children and stops a null between equal
  offsets from splitting one contiguous run into two. Add a noLastOffset
  sentinel and handle nil valids.
- array/record_test.go: fix a malformed JSON map-entry literal (missing
  comma) in TestRecordBuilder.
- ipc/metadata_test.go: TestUnrecognizedExtensionType now compares against
  the real preserved extension metadata (name arrow.uuid, empty serialized
  value). The previous placeholder only passed because record comparison
  ignored field metadata before the compare.go fix.
- extensions: add regression tests for leading-null AppendValues across all
  offset encodings and for a null between equal offsets keeping a single
  run-end-encoded run.
Address roborev review findings on the run-end-encoded offset builder:

- AppendValues now treats a nil valids slice as all-valid so the parent
  struct and its child builders stay the same length. Previously the
  struct received zero slots while the children received len(values),
  producing an inconsistent extension array.
- Override NewArray/NewExtensionArray to reset lastOffset to noLastOffset
  on finalization, so a reused builder starts a fresh run instead of
  continuing a run that belonged to the array just finalized (which could
  emit run-ends without a matching value on a run-end-encoded offset).
- Add regression tests for nil validity and builder reuse across all
  offset encodings.
Follow-up to the nil-validity fix: an empty (len 0) valids slice is also
all-valid, matching the convention of the other Arrow builders.
AppendValues now normalizes both nil and empty validity, and panics on a
genuine length mismatch (a non-empty valids whose length differs from
values).

Extend the validity regression test to cover the empty slice and add a
test asserting the length-mismatch panic.
Follow-up to the run-end-encoding fixes: AppendNull (and therefore the
UnmarshalOne(null) and AppendValueFromString(null) paths that call it)
appends a null offset run via the embedded struct builder but left
lastOffset stale. A subsequent value repeating the pre-null offset would
then continue the null run instead of starting its own, encoding that
value with the null offset run.

Override AppendNull/AppendNulls to reset lastOffset to noLastOffset, and
add a regression test for Append(value), AppendNull(), Append(same-offset
value) with run-end-encoded offsets.
The right-schema comparison fix routed RecordEqual/RecordApproxEqual/
TableEqual/TableApproxEqual through Field.Equal, which also compares field
and type metadata. That rejected logically-equal records differing only in
metadata -- for example a PARQUET:field_id attached during a parquet
round-trip -- breaking parquet/pqarrow tests such as TestForceLargeTypes.

Introduce fieldEqualIgnoringMetadata (name + nullability + metadata-
insensitive TypeEqual) and use it in the four comparison functions. It still
detects the nullability differences the schema check was added for, while
restoring the metadata-insensitive behavior these APIs had before. This makes
the earlier metadata_test.go expected-metadata tweak unnecessary, so revert it.
- compare.go: drop the arrow.TypeEqual call from the schema-field check.
  TypeEqual is not metadata-insensitive for large-list / list-view element
  types (they fall through to reflect.DeepEqual or Field.Equal), so relying
  on it could still reject records that differ only in child-field metadata.
  Field type and values are already validated per column by baseArrayEqual,
  so the schema-field check now compares only name and nullability (renamed
  schemaFieldEqual).
- ipc/metadata_test.go: use the real preserved extension metadata
  (arrow.uuid, empty serialized value) and assert on the read-back field
  metadata explicitly, since record comparison no longer inspects field
  metadata.
GetOneForMarshal previously grew a nullable bool parameter on the public
arrow.Array interface, breaking external callers and custom Array
implementations. Restore the single-argument GetOneForMarshal(i int) and
move the field-local nullability behavior to a new optional exported
interface, arrow.NullableMarshaler, with GetOneForMarshalNullable(i int,
nullable bool).

Containers (struct, sparse/dense union, dictionary, run-end-encoded,
extension) and RecordToJSON propagate each field's Nullable flag through
an unexported getOneForMarshalNullable helper that type-asserts to
NullableMarshaler and falls back to GetOneForMarshal for arrays that do
not implement it, preserving the previous validity-bitmap behavior.

ExtensionArrayBase deliberately does not implement NullableMarshaler so
that the optional interface is never promoted onto embedding extension
arrays and cannot bypass a concrete type's GetOneForMarshal override. The
helper instead handles the one field-local case that matters for plain
extension arrays: a null slot in a non-nullable field serializes the
storage value rather than JSON null, while still honoring any custom
GetOneForMarshal that returns a non-null logical value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants