More work supporting objects larger than 4GB on Windows#2137
Conversation
2491da9 to
f3aeae9
Compare
|
/submit |
|
Submitted as pull.2137.git.1780570272.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
| @@ -3232,7 +3232,7 @@ static int apply_binary_fragment(struct apply_state *state, | |||
| struct patch *patch) | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Thu, Jun 04, 2026 at 10:51:07AM +0000, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> `patch_delta()` takes the source and delta sizes by value and writes
> back the reconstructed target size through an `unsigned long *`. That
> datatype cannot represent a value that exceeds 4 GiB on systems where
> `unsigned long` is 32-bit (notably 64-bit Windows builds), though, even
> though the delta encoding itself, the on-disk layout, and the in-memory
> buffers happily carry such sizes. A `size_t` companion to
> `get_delta_hdr_size()`, `get_delta_hdr_size_sz()`, was introduced in
> 17fa077596 (delta, packfile: use size_t for delta header sizes,
> 2026-05-08) precisely so that `patch_delta()` could be widened without
> changing the on-the-wire decoding helper's signature.
>
> Widen `patch_delta()`'s three size parameters to `size_t` and switch
> its internal use of `get_delta_hdr_size()` to the `_sz` variant.
> Then propagate the wider type through the callers.
Does `get_delta_hdr_size()` have any remaining callers after this patch
series? I currently only spot two such callers, and you convert both of
them in this patch.
And can we reasonably add a test case that exercises this change?
> diff --git a/packfile.c b/packfile.c
> index 89366abfe3..e202f48837 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -1964,10 +1964,8 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset,
> (uintmax_t)curpos, p->pack_name);
> data = NULL;
> } else {
> - unsigned long sz;
> data = patch_delta(base, base_size, delta_data,
> - delta_size, &sz);
> - size = sz;
> + delta_size, &size);
Nice that we get rid of this awkward construct.
PatrickThere was a problem hiding this comment.
Johannes Schindelin wrote on the Git mailing list (how to reply to this email):
Hi Patrick,
On Mon, 15 Jun 2026, Patrick Steinhardt wrote:
> On Thu, Jun 04, 2026 at 10:51:07AM +0000, Johannes Schindelin via GitGitGadget wrote:
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > `patch_delta()` takes the source and delta sizes by value and writes
> > back the reconstructed target size through an `unsigned long *`. That
> > datatype cannot represent a value that exceeds 4 GiB on systems where
> > `unsigned long` is 32-bit (notably 64-bit Windows builds), though, even
> > though the delta encoding itself, the on-disk layout, and the in-memory
> > buffers happily carry such sizes. A `size_t` companion to
> > `get_delta_hdr_size()`, `get_delta_hdr_size_sz()`, was introduced in
> > 17fa077596 (delta, packfile: use size_t for delta header sizes,
> > 2026-05-08) precisely so that `patch_delta()` could be widened without
> > changing the on-the-wire decoding helper's signature.
> >
> > Widen `patch_delta()`'s three size parameters to `size_t` and switch
> > its internal use of `get_delta_hdr_size()` to the `_sz` variant.
> > Then propagate the wider type through the callers.
>
> Does `get_delta_hdr_size()` have any remaining callers after this patch
> series? I currently only spot two such callers, and you convert both of
> them in this patch.
As you noticed later on in the review: No, there are no such callers left,
and the `_sz` variant gets renamed, concluding the incremental migration
of that function from `unsigned long` to `size_t`.
> And can we reasonably add a test case that exercises this change?
Not reasonably, no. This would require constructing another artificial
_large_ object, this time with an unpacked Git object with a size >=4GB
that needs to be transmogrified into a different object.
Better leave the verification of this patch to static analysis (GCC or
Clang have become quite good at spotting things like this; Coverity would
be, too, if it ever comes back up from its "upgrades to the Scan servers",
https://web.archive.org/web/20260516152422/https://scan.coverity.com/
seems to be the start date of this update).
>
> > diff --git a/packfile.c b/packfile.c
> > index 89366abfe3..e202f48837 100644
> > --- a/packfile.c
> > +++ b/packfile.c
> > @@ -1964,10 +1964,8 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset,
> > (uintmax_t)curpos, p->pack_name);
> > data = NULL;
> > } else {
> > - unsigned long sz;
> > data = patch_delta(base, base_size, delta_data,
> > - delta_size, &sz);
> > - size = sz;
> > + delta_size, &size);
>
> Nice that we get rid of this awkward construct.
Awkward, but necessary to allow for an incremental, reviewable conversion
;-)
Ciao,
Johannes|
User |
| @@ -453,7 +453,7 @@ static int check_pack_inflate(struct packed_git *p, | |||
| struct pack_window **w_curs, | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Thu, Jun 04, 2026 at 10:51:08AM +0000, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> `write_reuse_object()` learned to track its packed-object size as
> `size_t` in 606c192380 (odb, packfile: use size_t for streaming
> object sizes, 2026-05-08), but the comparison sink it feeds,
> `check_pack_inflate()`, still takes the expected decompressed size
> as `unsigned long`. The call site bridges the mismatch with
> `cast_size_t_to_ulong()`, which on Windows turns a >4 GiB object
> into an immediate die().
>
> That function only uses `expect` once: as the right-hand side of a
> `stream.total_out == expect` equality test against zlib's counter.
> zlib's own `total_out` counter is `uLong` and is therefore still
> 32-bit-bound on Windows. Widening `expect` to `size_t` cannot fix that,
> but it is a strict improvement nonetheless: instead of dying outright,
> an oversized object now simply makes the equality fail and lets
> `write_reuse_object()` fall back to `write_no_reuse_object()`, which
> decompresses and re-deflates the content (and which the larger
> pack-objects widening series targets separately).
Hm. I wonder whether it's possible to reset `stream.total_out` on every
iteration and instead have a local `size_t` variable that we use to
track the total number of inflated bytes?
PatrickThere was a problem hiding this comment.
Johannes Schindelin wrote on the Git mailing list (how to reply to this email):
Hi Patrick,
On Mon, 15 Jun 2026, Patrick Steinhardt wrote:
> On Thu, Jun 04, 2026 at 10:51:08AM +0000, Johannes Schindelin via GitGitGadget wrote:
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > `write_reuse_object()` learned to track its packed-object size as
> > `size_t` in 606c192380 (odb, packfile: use size_t for streaming
> > object sizes, 2026-05-08), but the comparison sink it feeds,
> > `check_pack_inflate()`, still takes the expected decompressed size
> > as `unsigned long`. The call site bridges the mismatch with
> > `cast_size_t_to_ulong()`, which on Windows turns a >4 GiB object
> > into an immediate die().
> >
> > That function only uses `expect` once: as the right-hand side of a
> > `stream.total_out == expect` equality test against zlib's counter.
> > zlib's own `total_out` counter is `uLong` and is therefore still
> > 32-bit-bound on Windows. Widening `expect` to `size_t` cannot fix that,
> > but it is a strict improvement nonetheless: instead of dying outright,
> > an oversized object now simply makes the equality fail and lets
> > `write_reuse_object()` fall back to `write_no_reuse_object()`, which
> > decompresses and re-deflates the content (and which the larger
> > pack-objects widening series targets separately).
>
> Hm. I wonder whether it's possible to reset `stream.total_out` on every
> iteration and instead have a local `size_t` variable that we use to
> track the total number of inflated bytes?
Possible? Yes. Appropriate? Unlikely. We would now pretend to have
inflated less bytes, _just_ to appease a data type limitation that we
already worked around in d05d666977 (git-zlib: handle data streams larger
than 4GB, 2026-05-08).
Ciao,
Johannes| @@ -1239,6 +1239,8 @@ static void *gfi_unpack_entry( | |||
| unsigned long *sizep) | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Thu, Jun 04, 2026 at 10:51:09AM +0000, Johannes Schindelin via GitGitGadget wrote:
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 82bc6dcc00..3dff898c43 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -1239,6 +1239,8 @@ static void *gfi_unpack_entry(
> unsigned long *sizep)
> {
> enum object_type type;
> + size_t size_st = 0;
> + void *data;
> struct packed_git *p = all_packs[oe->pack_id];
> if (p == pack_data && p->pack_size < (pack_size + the_hash_algo->rawsz)) {
> /* The object is stored in the packfile we are writing to
> @@ -1260,7 +1262,10 @@ static void *gfi_unpack_entry(
> */
> p->pack_size = pack_size + the_hash_algo->rawsz;
> }
> - return unpack_entry(the_repository, p, oe->idx.offset, &type, sizep);
> + data = unpack_entry(the_repository, p, oe->idx.offset, &type, &size_st);
> + if (sizep)
> + *sizep = cast_size_t_to_ulong(size_st);
> + return data;
> }
Nit, please feel free to ignore: do we want to add a NEEDSWORK comment
here?
PatrickThere was a problem hiding this comment.
Johannes Schindelin wrote on the Git mailing list (how to reply to this email):
Hi Patrick,
On Mon, 15 Jun 2026, Patrick Steinhardt wrote:
> On Thu, Jun 04, 2026 at 10:51:09AM +0000, Johannes Schindelin via GitGitGadget wrote:
> > diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> > index 82bc6dcc00..3dff898c43 100644
> > --- a/builtin/fast-import.c
> > +++ b/builtin/fast-import.c
> > @@ -1239,6 +1239,8 @@ static void *gfi_unpack_entry(
> > unsigned long *sizep)
> > {
> > enum object_type type;
> > + size_t size_st = 0;
> > + void *data;
> > struct packed_git *p = all_packs[oe->pack_id];
> > if (p == pack_data && p->pack_size < (pack_size + the_hash_algo->rawsz)) {
> > /* The object is stored in the packfile we are writing to
> > @@ -1260,7 +1262,10 @@ static void *gfi_unpack_entry(
> > */
> > p->pack_size = pack_size + the_hash_algo->rawsz;
> > }
> > - return unpack_entry(the_repository, p, oe->idx.offset, &type, sizep);
> > + data = unpack_entry(the_repository, p, oe->idx.offset, &type, &size_st);
> > + if (sizep)
> > + *sizep = cast_size_t_to_ulong(size_st);
> > + return data;
> > }
>
> Nit, please feel free to ignore: do we want to add a NEEDSWORK comment
> here?
Hehe... My mind translates the `cast_size_t_to_ulong()` function to
"NEEDSWORK!" already ;-)
Ciao,
Johannes| @@ -86,11 +86,8 @@ void *patch_delta(const void *src_buf, unsigned long src_size, | |||
| * This must be called twice on the delta data buffer, first to get the | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Thu, Jun 04, 2026 at 10:51:11AM +0000, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> When I started the transition from `unsigned long` to `size_t`, in the
> interest of keeping the patches reviewable, I introduced these calls to
> prevent data type narrowing from silently failing to handle large object
> sizes. I also introduced `*_sz()` variants that would allow most of the
> callers to keep using that `unsigned long` that the 90s kindly asked to
> be returned.
>
> After the preceding commits, the only places that called the narrow
> wrappers either no longer exist or already use the `_sz` form
> internally, so the wrappers just narrow values back through
> `cast_size_t_to_ulong()` for no reason.
>
> Drop them and rename the `_sz` variants back to the natural names.
Aha, so you already address my comment I had on one of the preceding
patches :)
Patrick| @@ -3321,7 +3321,7 @@ static int apply_binary(struct apply_state *state, | |||
| if (odb_has_object(the_repository->objects, &oid, 0)) { | |||
There was a problem hiding this comment.
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):
On Thu, Jun 04, 2026 at 10:51:12AM +0000, Johannes Schindelin via GitGitGadget wrote:
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index fa45f774d7..fa6e396ddc 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -120,7 +120,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> struct object_id oid;
> enum object_type type;
> char *buf;
> - unsigned long size;
> + size_t size;
> struct object_context obj_context = {0};
> struct object_info oi = OBJECT_INFO_INIT;
> unsigned flags = OBJECT_INFO_LOOKUP_REPLACE;
> @@ -166,7 +166,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> if (use_mailmap && (type == OBJ_COMMIT || type == OBJ_TAG)) {
> size_t s = size;
> buf = replace_idents_using_mailmap(buf, &s);
> - size = cast_size_t_to_ulong(s);
> + size = s;
> }
>
> printf("%"PRIuMAX"\n", (uintmax_t)size);
Can't we drop this local variable completely and instead supply `&size`
directly?
> @@ -219,7 +225,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> if (use_mailmap) {
> size_t s = size;
> buf = replace_idents_using_mailmap(buf, &s);
> - size = cast_size_t_to_ulong(s);
> + size = s;
> }
>
> /* otherwise just spit out the data */
> @@ -266,7 +272,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> if (use_mailmap) {
> size_t s = size;
> buf = replace_idents_using_mailmap(buf, &s);
> - size = cast_size_t_to_ulong(s);
> + size = s;
> }
> break;
> }
> @@ -446,7 +455,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
> if (use_mailmap) {
> size_t s = size;
> contents = replace_idents_using_mailmap(contents, &s);
> - size = cast_size_t_to_ulong(s);
> + size = s;
> }
>
> if (type != data->type)
Likewise for these three instances.
> @@ -555,7 +564,7 @@ static void batch_object_write(const char *obj_name,
> if (!buf)
> die(_("unable to read %s"), oid_to_hex(&data->oid));
> buf = replace_idents_using_mailmap(buf, &s);
> - data->size = cast_size_t_to_ulong(s);
> + data->size = s;
>
> free(buf);
> }
And I think this site here can be adapted, as well.
> diff --git a/diff.c b/diff.c
> index 5a584fa1d5..816b89dc6c 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -4594,8 +4594,9 @@ int diff_populate_filespec(struct repository *r,
> }
> }
> else {
> + size_t size_st = 0;
> struct object_info info = {
> - .sizep = &s->size
> + .sizep = &size_st
> };
>
> if (!(size_only || check_binary))
> @@ -4617,6 +4618,7 @@ int diff_populate_filespec(struct repository *r,
> die("unable to read %s", oid_to_hex(&s->oid));
>
> object_read:
> + s->size = cast_size_t_to_ulong(size_st);
> if (size_only || check_binary) {
> if (size_only)
> return 0;
> @@ -4631,6 +4633,7 @@ object_read:
> if (odb_read_object_info_extended(r->objects, &s->oid, &info,
> OBJECT_INFO_LOOKUP_REPLACE))
> die("unable to read %s", oid_to_hex(&s->oid));
> + s->size = cast_size_t_to_ulong(size_st);
> }
> s->should_free = 1;
> }
The flow in this function is quite weird if you ask me, but that's a
preexisting issue. This does look correct to me, even if it's awkward.
PatrickThere was a problem hiding this comment.
Johannes Schindelin wrote on the Git mailing list (how to reply to this email):
Hi Patrick,
On Mon, 15 Jun 2026, Patrick Steinhardt wrote:
> On Thu, Jun 04, 2026 at 10:51:12AM +0000, Johannes Schindelin via GitGitGadget wrote:
> > diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> > index fa45f774d7..fa6e396ddc 100644
> > --- a/builtin/cat-file.c
> > +++ b/builtin/cat-file.c
> > @@ -120,7 +120,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> > struct object_id oid;
> > enum object_type type;
> > char *buf;
> > - unsigned long size;
> > + size_t size;
> > struct object_context obj_context = {0};
> > struct object_info oi = OBJECT_INFO_INIT;
> > unsigned flags = OBJECT_INFO_LOOKUP_REPLACE;
> > @@ -166,7 +166,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> > if (use_mailmap && (type == OBJ_COMMIT || type == OBJ_TAG)) {
> > size_t s = size;
> > buf = replace_idents_using_mailmap(buf, &s);
> > - size = cast_size_t_to_ulong(s);
> > + size = s;
> > }
> >
> > printf("%"PRIuMAX"\n", (uintmax_t)size);
>
> Can't we drop this local variable completely and instead supply `&size`
> directly?
Well spotted!
> > @@ -219,7 +225,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> > if (use_mailmap) {
> > size_t s = size;
> > buf = replace_idents_using_mailmap(buf, &s);
> > - size = cast_size_t_to_ulong(s);
> > + size = s;
> > }
> >
> > /* otherwise just spit out the data */
> > @@ -266,7 +272,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name)
> > if (use_mailmap) {
> > size_t s = size;
> > buf = replace_idents_using_mailmap(buf, &s);
> > - size = cast_size_t_to_ulong(s);
> > + size = s;
> > }
> > break;
> > }
> > @@ -446,7 +455,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
> > if (use_mailmap) {
> > size_t s = size;
> > contents = replace_idents_using_mailmap(contents, &s);
> > - size = cast_size_t_to_ulong(s);
> > + size = s;
> > }
> >
> > if (type != data->type)
>
> Likewise for these three instances.
I totally agree.
> > @@ -555,7 +564,7 @@ static void batch_object_write(const char *obj_name,
> > if (!buf)
> > die(_("unable to read %s"), oid_to_hex(&data->oid));
> > buf = replace_idents_using_mailmap(buf, &s);
> > - data->size = cast_size_t_to_ulong(s);
> > + data->size = s;
> >
> > free(buf);
> > }
>
> And I think this site here can be adapted, as well.
Indeed!
> > diff --git a/diff.c b/diff.c
> > index 5a584fa1d5..816b89dc6c 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -4594,8 +4594,9 @@ int diff_populate_filespec(struct repository *r,
> > }
> > }
> > else {
> > + size_t size_st = 0;
> > struct object_info info = {
> > - .sizep = &s->size
> > + .sizep = &size_st
> > };
> >
> > if (!(size_only || check_binary))
> > @@ -4617,6 +4618,7 @@ int diff_populate_filespec(struct repository *r,
> > die("unable to read %s", oid_to_hex(&s->oid));
> >
> > object_read:
> > + s->size = cast_size_t_to_ulong(size_st);
> > if (size_only || check_binary) {
> > if (size_only)
> > return 0;
> > @@ -4631,6 +4633,7 @@ object_read:
> > if (odb_read_object_info_extended(r->objects, &s->oid, &info,
> > OBJECT_INFO_LOOKUP_REPLACE))
> > die("unable to read %s", oid_to_hex(&s->oid));
> > + s->size = cast_size_t_to_ulong(size_st);
> > }
> > s->should_free = 1;
> > }
>
> The flow in this function is quite weird if you ask me, but that's a
> preexisting issue. This does look correct to me, even if it's awkward.
Yes, on all four accounts.
Ciao,
JohannesOn Windows, `unsigned long` and `long` are 32 bits even on 64-bit builds. The MSVC compatibility header has shimmed `ftruncate()` with #define ftruncate _chsize ever since `compat/msvc-posix.h` was introduced. `_chsize()` takes a 32-bit `long` for the new length, which silently truncates files (and the requested size) to 2 GiB. That is enough to make t7508 test 126 "git add fails gracefully with 4 GiB and 8 GiB files" fail under MSVC: `test-tool truncate` creates a sparse 4 GiB or 8 GiB file via the shimmed `ftruncate()`, and the test never gets off the ground. `_chsize_s()` is the modern replacement, accepts a 64-bit `__int64` length, and is the only sensible target on Windows. The catch is that it does not follow the POSIX `-1` + `errno` convention: it returns `0` on success and an errno value (a small positive integer) on failure. A plain `#define ftruncate _chsize_s` would therefore silently break callers that test the return value as `< 0` or against `-1`, of which there are several: `http.c`, `parallel-checkout.c`, and `t/helper/test-truncate.c` among them. Introduce a `static inline` wrapper that calls `_chsize_s()`, copies its errno return into `errno`, and translates the result to the familiar `-1` / `0` convention, then point `ftruncate` at the wrapper. Place the wrapper after `#include "mingw-posix.h"` so the `off_t` parameter resolves to the already-widened `off64_t` rather than the 32-bit `_off_t` from `compat/vcbuild/include/unistd.h`. MinGW is unaffected: its `ftruncate()` already takes `off_t` and routes through `ftruncate64()` when `_FILE_OFFSET_BITS=64`, which is the default in our build. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
`patch_delta()` takes the source and delta sizes by value and writes back the reconstructed target size through an `unsigned long *`. That datatype cannot represent a value that exceeds 4 GiB on systems where `unsigned long` is 32-bit (notably 64-bit Windows builds), though, even though the delta encoding itself, the on-disk layout, and the in-memory buffers happily carry such sizes. A `size_t` companion to `get_delta_hdr_size()`, `get_delta_hdr_size_sz()`, was introduced in 17fa077 (delta, packfile: use size_t for delta header sizes, 2026-05-08) precisely so that `patch_delta()` could be widened without changing the on-the-wire decoding helper's signature. Widen `patch_delta()`'s three size parameters to `size_t` and switch its internal use of `get_delta_hdr_size()` to the `_sz` variant. Then propagate the wider type through the callers. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
`write_reuse_object()` learned to track its packed-object size as `size_t` in 606c192 (odb, packfile: use size_t for streaming object sizes, 2026-05-08), but the comparison sink it feeds, `check_pack_inflate()`, still takes the expected decompressed size as `unsigned long`. The call site bridges the mismatch with `cast_size_t_to_ulong()`, which on Windows turns a >4 GiB object into an immediate die(). That function only uses `expect` once: as the right-hand side of a `stream.total_out == expect` equality test against zlib's counter. zlib's own `total_out` counter is `uLong` and is therefore still 32-bit-bound on Windows. Widening `expect` to `size_t` cannot fix that, but it is a strict improvement nonetheless: instead of dying outright, an oversized object now simply makes the equality fail and lets `write_reuse_object()` fall back to `write_no_reuse_object()`, which decompresses and re-deflates the content (and which the larger pack-objects widening series targets separately). Drop the `cast_size_t_to_ulong()` shim at the call site now that the receiving parameter speaks the same type as `entry_size`. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The topic `js/objects-larger-than-4gb-on-windows` widened the streaming, index-pack and unpack-objects paths to `size_t` but deliberately stopped at the in-memory `unpack_entry()` cascade, which still hands back the unpacked size through `unsigned long *`. On Windows that boundary truncates above 4 GiB because that data type is only 32 bits wide on that platform. Widen the code path. Except `packed_object_info_with_index_pos()`: It cannot yet pass `oi->sizep` directly because the field is still `unsigned long *`; bridge it with a `size_t` temporary that narrows back, and let a later commit drop the bridge once the field is wide too. `gfi_unpack_entry()` keeps its narrow signature because fast-import tracks sizes through `unsigned long` everywhere it crosses subsystem boundaries, keeping its signature allows the scope of this commit to be somewhat reasonable, still. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
`pack-objects` stores per-entry object sizes in either the 31-bit `size_` member of the `struct object_entry` or, when the value does not fit, the `pack->delta_size[]` spill array. The accessors (`oe_size`, `oe_delta_size`, `oe_get_size_slow`, `oe_size_*_than`) and the setters (`oe_set_size`, `oe_set_delta_size`) used `unsigned long` for the spill type, which on Windows means the spill silently caps at 4 GiB per entry. That is what made `upload-pack` die with "object too large to read on this platform" when serving the >4 GiB blob in `t5608` tests 5 and 6 when run with `GIT_TEST_CLONE_2GB`. Widen them all to `size_t` (including `pack->delta_size`) and drop the three `cast_size_t_to_ulong()` calls in `check_object()` that guarded `in_pack_size`. The two `SET_SIZE(entry, canonical_size)` calls in the same function stay cast-free as before, since `canonical_size` is still `unsigned long` until a later commit widens `object_info::sizep`. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
When I started the transition from `unsigned long` to `size_t`, in the interest of keeping the patches reviewable, I introduced these calls to prevent data type narrowing from silently failing to handle large object sizes. I also introduced `*_sz()` variants that would allow most of the callers to keep using that `unsigned long` that the 90s kindly asked to be returned. After the preceding commits, the only places that called the narrow wrappers either no longer exist or already use the `_sz` form internally, so the wrappers just narrow values back through `cast_size_t_to_ulong()` for no reason. Drop them and rename the `_sz` variants back to the natural names. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
When `js/objects-larger-than-4gb-on-windows` widened the streaming, index-pack and unpack-objects code paths, in the interest of keeping the patches somewhat reasonably-sized, it left the public ODB API still typed in `unsigned long`. In particular `struct object_info::sizep` and the four wrappers built on top of it (`odb_read_object`, `odb_read_object_peeled`, `odb_read_object_info`, `odb_pretend_object`) still return the unpacked size through `unsigned long *`, so on Windows `cat-file -s` and the `git add` / `git status` paths for a >4 GiB blob silently cap at 4 GiB. Widen the field and the four wrappers. The previous commits already widened the `unpack_entry()` cascade and pack-objects' in-core size accessors, so most of the cascade arrives here with no further work: the temporary shims in `packed_object_info_with_index_pos()` and in `unpack_entry()`'s delta-base recovery path go away, the two `SET_SIZE(entry, cast_size_t_to_ulong(canonical_size))` calls in `check_object()` and the matching one in `drop_reused_delta()` collapse to plain `SET_SIZE`, and `oe_get_size_slow()`'s tail `cast_size_t_to_ulong()` is gone too. What remains narrow are the boundaries this series does not intend to touch: the diff, blame, textconv and fast-import machinery. Even so, this patch is unfortunately quite large. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
f3aeae9 to
37d030d
Compare
|
/submit |
|
Submitted as pull.2137.v2.git.1781524349.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
This patch series tries to address the problems pointed out by the expensive tests that now run in CI: t5608 and t7508 verify various aspects about objects larger than 4GB, which Git does not currently handle correctly when run on a platform where
size_tis 64-bit andunsigned longis 32-bit.Changes vs v1:
master, which mergedps/odb-source-loose(with which these patches previously conflicted rather badly).size_t svariables (thanks, Patrick!).Cc: Signed-off-by: Kristofer Karlsson krka@spotify.com
cc: Patrick Steinhardt ps@pks.im