Commit Graph

751 Commits

Author SHA1 Message Date
Johannes Schindelin
624ede0fb4 pack-objects: drop the last size shim in write_no_reuse_object()
Continue the size_t evacuation that this series and the merged
js/objects-larger-than-4gb-on-windows topic are advancing for
>4 GiB objects on Windows: with the odb readers and the zlib
helpers reached from do_compress() now widened end-to-end, the
last cast_size_t_to_ulong() shim in this function can be removed,
and do_compress() itself can carry the new size type through.

Two cast_size_t_to_ulong() shims remain in this file; they feed
the tree-walk API, which is still narrow and is a separate
widening topic.

write_no_reuse_object()'s return type and the hashfile API are
still narrow but unchanged in observable behaviour: on 64-bit
Linux ulong coincides with size_t, and on Windows these were the
narrow fenceposts the prior topics deliberately left in place.
Their widening is left to follow-ups touching the hashfile API
and the write_object() caller chain.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:55 +00:00
Johannes Schindelin
4a47d0787c pack-objects: drop cast_size_t_to_ulong shims in try_delta()
Companion to the prior get_delta() cleanup, and the last try_delta()
piece of the >4 GiB delta-path topic. Every consumer that the
function's locals fed has now been widened: SIZE() / DELTA_SIZE() to
size_t (prior topic), the mem_usage out-parameter and delta_cacheable()
earlier in this series, and create_delta() / create_delta_index() in
the immediately preceding commits.

Widen the declaration of trg_size, src_size, sizediff, max_size and
sz to size_t (delta_size joins them on the same line, removing the
size_t delta_size line that the create_delta() widening commit added
as a stop-gap), and drop the two sz_st bridge variables together with
the surrounding cast_size_t_to_ulong() calls. The result is just
"odb_read_object(&sz)" on both reads.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:55 +00:00
Johannes Schindelin
da17ddcf28 pack-objects: drop cast_size_t_to_ulong shims in get_delta()
The two shims that 606c192380 (odb, packfile: use size_t for
streaming object sizes, 2026-05-08) and the subsequent
odb_read_object() widening introduced as scaffolding around
get_delta()'s reads can now disappear: the previous commit widened
diff_delta() to size_t, which was the last narrow consumer in this
function.

Widen size and base_size to size_t outright, drop the size_st /
base_size_st bridging temporaries, and drop the two
cast_size_t_to_ulong() calls. Net change is 4 lines smaller and one
read-then-cast indirection gone from each odb read.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:55 +00:00
Johannes Schindelin
9a9bc5c853 Merge branch 'size-t/tree' 2026-06-26 08:57:54 +00:00
Johannes Schindelin
824c02f12f pack-objects: drop the two tree-walk casts in the preferred-base path
With init_tree_desc() widened in the prior commit, the
size_t-returning odb_read_object_peeled() call in
add_preferred_base() and odb_read_object() call in pbase_tree_get()
can both flow straight through to init_tree_desc() and into the
pbase_tree_cache. Widen pbase_tree_cache.tree_size and the two
local size variables to size_t, drop the size_st bridges, and drop
the two cast_size_t_to_ulong() shims.

This was the last pair of cast_size_t_to_ulong() call sites in
builtin/pack-objects.c, completing the >4 GiB-objects work in that
file that this branch and its predecessors have been pursuing.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
1d0e6f9746 packfile, git-zlib: widen use_pack() and zstream avail fields to size_t
Bundling the two widenings: four call sites pass &stream.avail_in
directly to use_pack(), and widening either type fencepost alone
would force a bridge variable at each. Doing both together is the
simpler end state and is the prerequisite for the do_compress()
widening in the next commit, which is what lets
write_no_reuse_object() lose its last cast_size_t_to_ulong() shim.

The unsigned-long locals widened at the other use_pack() callers
(avail / remaining / left) hold pack-window sizes bounded by
core.packedGitWindowSize, so the change is type consistency rather
than a new >4GB capability. git_zstream.avail_in / avail_out
likewise reach zlib's uInt fields only after zlib_buf_cap()'s 1 GiB
cap, so the wrapper already accepted size_t-shaped inputs in
practice.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
2f8a320527 delta: widen create_delta() and diff_delta() to size_t
Last stop in the delta-encoding API widening for >4 GiB blobs on
Windows: with create_delta_index() done in the prior commit and
create_delta()/diff_delta() finished here, every byte count that
crosses delta.h is now size_t. The struct fields they store into
have been size_t since the diff-delta struct widening.

The API change must move with all callers in the same commit (the
build only passes when every &delta_size matches the new size_t*).
Caller updates are kept minimal:

  * builtin/pack-objects.c get_delta() and try_delta(): widen only
    the local delta_size variable; the surrounding unsigned-long
    locals and their cast_size_t_to_ulong() shims are out of scope
    here and will be cleaned up in their own commits.

  * builtin/fast-import.c, diff.c, t/helper/test-pack-deltas.c:
    keep the local unsigned-long delta size (each feeds a still-
    unsigned-long downstream consumer: zlib's avail_in,
    deflate_it(), the test helper's own do_compress()), and bridge
    via a temporary size_t plus cast_size_t_to_ulong(). The new
    casts are paid back in later topics that widen those consumers.

  * t/helper/test-delta.c: widen the local outright (no downstream
    consumer beyond the test's own out_size, which is already
    size_t).

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
f9a6df6aba pack-objects: widen mem_usage and try_delta out-param to size_t
The pair must move together because find_deltas() passes &mem_usage
to try_delta(): widening either alone breaks the type match.

mem_usage accumulates per-object byte counts already computed in
size_t (SIZE() and sizeof_delta_index() reach here through
free_unpacked(), now size_t), and was the last 32-bit-on-Windows
narrowing point in the delta-window memory accounting chain. With
this commit, that chain is internally size_t end-to-end except for
sizeof_delta_index()'s still-narrow return, whose value is bounded
by create_delta_index()'s entries cap.

window_memory_limit (config-driven via git_config_ulong()) stays
unsigned long: it is only compared against mem_usage and promotes.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
819d8a1cfd pack-objects: widen free_unpacked() return to size_t
free_unpacked() sums two byte counts: sizeof_delta_index() and
SIZE(n->entry). The latter has been size_t since the prior topic
"More work supporting objects larger than 4GB on Windows" widened
SIZE() / oe_size() to size_t, so accumulating it into an unsigned
long return was a silent Windows-only truncation on a packing run
with many large objects.

The sole caller (find_deltas()) holds its own mem_usage in an
unsigned long for now and subtracts the return into it, so the new
narrowing happens at that subtraction. find_deltas() and the
matching try_delta() out-parameter are widened next.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
848acf183c pack-objects: widen delta-cache accounting to size_t
These three are a single accounting tuple (the globals tracking
cumulative cached-delta bytes, plus the helper that compares them
against an incoming delta size) and are latently 32-bit on Windows
where unsigned long != size_t: a pack with many large cached deltas
could wrap silently.

The widening is internally consistent on its own: the additions and
subtractions against delta_cache_size already come from size_t
sources (DELTA_SIZE() returns size_t), and delta_cacheable()'s sole
caller in try_delta() still passes unsigned long, which promotes.

Prerequisite for dropping try_delta()'s cast_size_t_to_ulong()
shims, which becomes possible once create_delta() and diff_delta()
are widened in a later commit.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Junio C Hamano
2fb57b8177 Merge branch 'ps/odb-drop-whence' into jch
The whence field in struct object_info has been removed,
refactoring backend-specific object information retrieval into an
opt-in struct object_info_source structure.

* ps/odb-drop-whence:
  odb: document object info fields
  odb: drop `whence` field from object info
  treewide: convert users of `whence` to the new source field
  odb: add `source` field to struct object_info_source
  odb: make backend-specific fields optional
  packfile: thread odb_source_packed through packed_object_info()
2026-06-25 19:49:27 -07:00
Junio C Hamano
8811ac8af4 Merge branch 'tb/pack-path-walk-bitmap-delta-islands' into jch
The pack-objects command now supports using reachability bitmaps and
delta-islands concurrently with the `--path-walk` option, allowing
faster packaging by falling back to path-walk when bitmaps cannot
fully satisfy the request.

* tb/pack-path-walk-bitmap-delta-islands:
  pack-objects: support `--delta-islands` with `--path-walk`
  pack-objects: extract `record_tree_depth()` helper
  pack-objects: support reachability bitmaps with `--path-walk`
  t/perf: drop p5311's lookup-table permutation
2026-06-25 19:49:19 -07:00
Junio C Hamano
2a8c778710 Merge branch 'ps/odb-source-packed' into jch
The packed object source has been refactored into a proper struct
odb_source.

* ps/odb-source-packed:
  odb/source-packed: drop pointer to "files" parent source
  midx: refactor interfaces to work on "packed" source
  odb/source-packed: stub out remaining functions
  odb/source-packed: wire up `freshen_object()` callback
  odb/source-packed: wire up `find_abbrev_len()` callback
  odb/source-packed: wire up `count_objects()` callback
  odb/source-packed: wire up `for_each_object()` callback
  odb/source-packed: wire up `read_object_stream()` callback
  odb/source-packed: wire up `read_object_info()` callback
  packfile: use higher-level interface to implement `has_object_pack()`
  odb/source-packed: wire up `reprepare()` callback
  odb/source-packed: wire up `close()` callback
  odb/source-packed: start converting to a proper `struct odb_source`
  odb/source-packed: store pointer to "files" instead of generic source
  packfile: move packed source into "odb/" subsystem
  packfile: split out packfile list logic
  packfile: rename `struct packfile_store` to `odb_source_packed`
2026-06-25 19:49:17 -07:00
Patrick Steinhardt
4f48d2a241 treewide: convert users of whence to the new source field
The `whence` field has become redundant now that callers can learn about
the exact source an object has been looked up from via the `struct
object_info_source::source` field.

Adapt callers to use the new field. Note that all callsites already set
up the `info.sourcep` request pointer, so the conversion is rather
straight-forward.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-24 10:12:35 -07:00
Patrick Steinhardt
695797490e odb: make backend-specific fields optional
The `struct object_info` carries two pieces of information
about how an object was looked up:

  - The `whence` enum identifying the backend.

  - The backend-tagged union `u` exposing backend-specific details
    (currently only the packed-source case, which records the owning
    pack, offset and packed object type).

The union is populated unconditionally, even though most callers don't
care about provenance at all.

Split the backend-specific union out into a new public type, `struct
object_info_source`, and make the object info structure carry it via
just another opt-in request pointer. As with all the other requestable
information, callers that need source info allocate a `struct
object_info_source` on the stack and point `sourcep` at it; callers that
don't care about it simply leave the field as a `NULL` pointer. Adapt
callers accordingly.

Note that the `whence` enum is strictly-speaking also backend-specific
information, so it would be another good candidate to be moved into the
`struct object_info_source`. For now though it is left alone, as it will
be replaced by a `struct odb_source` pointer in a subsequent commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-24 10:12:35 -07:00
Patrick Steinhardt
1b9f137b43 packfile: thread odb_source_packed through packed_object_info()
Add an optional `struct odb_source_packed *source` parameter to
`packed_object_info()` and `packed_object_info_with_index_pos()`. This
parameter is unused at this point in time, but it will be used in a
follow-up commit so that we can record the source of a specific object.

Note that callers in "odb/source-packed.c" pass the already-available
source, but all other callers pass `NULL` instead. This is fine though,
as we only care about populating this info when called via the packed
store.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-24 10:12:35 -07:00
Junio C Hamano
2ed34f72cf Merge branch 'ps/odb-source-packed' into ps/odb-drop-whence
* ps/odb-source-packed:
  odb/source-packed: drop pointer to "files" parent source
  midx: refactor interfaces to work on "packed" source
  odb/source-packed: stub out remaining functions
  odb/source-packed: wire up `freshen_object()` callback
  odb/source-packed: wire up `find_abbrev_len()` callback
  odb/source-packed: wire up `count_objects()` callback
  odb/source-packed: wire up `for_each_object()` callback
  odb/source-packed: wire up `read_object_stream()` callback
  odb/source-packed: wire up `read_object_info()` callback
  packfile: use higher-level interface to implement `has_object_pack()`
  odb/source-packed: wire up `reprepare()` callback
  odb/source-packed: wire up `close()` callback
  odb/source-packed: start converting to a proper `struct odb_source`
  odb/source-packed: store pointer to "files" instead of generic source
  packfile: move packed source into "odb/" subsystem
  packfile: split out packfile list logic
  packfile: rename `struct packfile_store` to `odb_source_packed`
2026-06-24 10:12:12 -07:00
Junio C Hamano
02bb39c5cb Merge branch 'js/objects-larger-than-4gb-on-windows-more'
* js/objects-larger-than-4gb-on-windows-more:
  odb: use size_t for object_info.sizep and the size APIs
  packfile,delta: drop the `cast_size_t_to_ulong()` wrappers
  pack-objects: use size_t for in-core object sizes
  packfile: widen unpack_entry()'s size out-parameter to size_t
  pack-objects(check_pack_inflate()): use size_t instead of unsigned long
  patch-delta: use size_t for sizes
  compat/msvc: use _chsize_s for ftruncate
2026-06-21 16:41:38 -07:00
Taylor Blau
7e6de2ac62 pack-objects: support --delta-islands with --path-walk
Since the inception of `--path-walk`, this option has had a documented
incompatibility with `--delta-islands`.

When discussing those original patches on the list, a message from
Stolee in [1] noted the following:

    this could be remedied by [...] doing a separate walk to identify
    islands using the normal method

In a related portion of the thread, Peff explains[2]:

    The delta islands code already does its own tree walk to propagate
    the bits down (it does rely on the base walk's show_commit() to
    propagate through the commits).

    Once each object has its island bitmaps, I think however you
    choose to come up with delta candidates [...] you should be able
    to use it. It's fundamentally just answering the question of "am
    I allowed to delta between these two objects".

That is similar to what this patch does, and it turns out the cheaper
option is sufficient: perform the same island side effects from the
path-walk callback rather than doing a second walk.

Recall how delta-islands are computed during a normal repack:

 - `show_commit()` calls `propagate_island_marks()` for each commit,
   which merges the commit's island bitset onto its root tree object and
   onto each of its parent commits.

 - `show_object()` for a tree records the tree's depth derived from the
   slash-separated pathname. Subsequent `resolve_tree_islands()` uses
   that depth to walk trees in increasing-depth order, propagating each
   tree's marks to its children.

 - At delta-search time, `in_same_island()` enforces that a delta
   target's island bitmap is a subset of its base's: every island that
   reaches the target must also reach the base.

Path-walk's enumeration callback is `add_objects_by_path()`. It already
adds objects to `to_pack`, but until now did not perform the
island-related side effects. Two things are needed:

 - For each commit batch, call `propagate_island_marks()` on commits,
   exactly as `show_commit()` does.

   We have to be careful about the order in which we call this function,
   and we must see a commit before its parents in order to have
   island marks to propagate.

   The path-walk batch preserves that order. Path-walk appends commits
   to its `OBJ_COMMIT` batch as they come back from the same
   `get_revision()` loop the regular traversal uses, and
   `add_objects_by_path()` iterates the batch in array order. So every
   commit reaches `propagate_island_marks()` in the same sequence that
   `show_commit()` would have seen it, and the descendant-first chain
   that the algorithm relies on is intact.

   Skip island propagation for excluded commits to match the regular
   traversal, whose `show_commit()` callback is only invoked for
   interesting commits. Boundary commits may still be present in
   path-walk's callback so they can serve as thin-pack bases, but they
   should not contribute island marks.

 - For each tree batch, record the tree's depth from the path. Use the
   `record_tree_depth()` helper from the previous commit so both
   callbacks behave identically, including the max-depth-wins behavior
   when a tree is reached via more than one path. The helper accepts
   both the `show_object()` path shape ("foo", "foo/bar") and the
   path-walk shape with a trailing slash ("foo/", "foo/bar/"), so depths
   recorded from either traversal mode are directly comparable.

   This is implicit in the implementation sketch from Peff above.
   `resolve_tree_islands()` sorts trees by `oe->tree_depth` in
   increasing-depth order before propagating marks down, so that a
   parent tree's marks are finalized before its children inherit them.
   Without recording the depth at path-walk time, every
   path-walk-discovered tree would land at depth 0 in `to_pack`, the
   sort would lose its ordering, and children could inherit marks from
   parents whose own contributions had not yet been merged in.

With those two pieces in place, `resolve_tree_islands()` receives the
same island inputs from path-walk as it would from the regular
traversal, so the existing island checks can be reused unchanged.

Drop the documented incompatibility between `--path-walk` and
`--delta-islands`, and add t5320 coverage for path-walk island repacks
with and without bitmap writing, as well as the same-island case where a
delta remains allowed.

[1]: https://lore.kernel.org/git/9aa2471b-0850-4707-9733-d3b33609f5f2@gmail.com/
[2]: https://lore.kernel.org/git/20240911063203.GA1538586@coredump.intra.peff.net/

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-21 16:26:14 -07:00
Taylor Blau
264efee401 pack-objects: extract record_tree_depth() helper
Prepare for a subsequent change that needs to record tree depths from a
second call site by factoring the delta-islands tree-depth bookkeeping
out of `show_object()` and into a helper, `record_tree_depth()`.

The helper looks up the object in `to_pack`, returns early when the
object was not added there, computes the depth from the slash count in
the supplied name, and preserves the existing max-depth-wins behavior
when a tree is reached by more than one path.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-21 16:26:14 -07:00
Taylor Blau
0a37451106 pack-objects: support reachability bitmaps with --path-walk
When 'pack-objects' is invoked with '--path-walk', it prevents us from
using reachability bitmaps.

This behavior dates back to 70664d2865 (pack-objects: add --path-walk
option, 2025-05-16), which included a comment in the relevant portion of
the command-line arguments handling that read as follows:

    /*
     * We must disable the bitmaps because we are removing
     * the --objects / --objects-edge[-aggressive] options.
     */

In fb2c309b7d (pack-objects: pass --objects with --path-walk,
2026-05-02), path-walk learned to pass '--objects' again, but still
kept bitmap traversal disabled. That leaves two useful cases
unsupported:

 * A path-walk repack that writes bitmaps does not give the bitmap
   selector any commits, because path-walk reveals commits through
   `add_objects_by_path()` rather than through `show_commit()`, where
   `index_commit_for_bitmap()` is normally called.

 * An invocation like "git pack-objects --use-bitmap-index --path-walk"
   never tries an existing bitmap, even when one is available and could
   answer the request.

Fortunately for us, neither restriction is required.

 * On the writing side: teach the path-walk object callback to call
   `index_commit_for_bitmap()` for commits that it adds to the pack.
   That gives the bitmap selector the commit candidates it would have
   seen from the regular traversal.

 * For bitmap reading, keep passing '--objects' to the internal rev_list
   machinery, but stop clearing `use_bitmap_index`. If an existing
   bitmap can answer the request, use it; otherwise fall back to
   path-walk's own enumeration.

As a result, we can see significantly reduced pack generation times from
p5311 (with our `GIT_PERF_REPO` set to a recent clone of the fluentui
repository) before this commit:

    Test                                            HEAD^             HEAD
    ----------------------------------------------------------------------------------------
    5311.40: server (1 days, --path-walk)           1.43(1.39+0.04)   0.01(0.01+0.00) -99.3%
    5311.41: size   (1 days, --path-walk)                    139.6K            139.7K +0.0%
    5311.42: client (1 days, --path-walk)           0.02(0.02+0.00)   0.02(0.02+0.00) +0.0%
    5311.44: server (2 days, --path-walk)           1.43(1.39+0.04)   0.01(0.00+0.00) -99.3%
    5311.45: size   (2 days, --path-walk)                    139.6K            139.7K +0.0%
    5311.46: client (2 days, --path-walk)           0.02(0.02+0.00)   0.02(0.02+0.00) +0.0%
    5311.48: server (4 days, --path-walk)           1.44(1.39+0.04)   0.01(0.01+0.00) -99.3%
    5311.49: size   (4 days, --path-walk)                    238.1K            238.1K +0.0%
    5311.50: client (4 days, --path-walk)           0.03(0.03+0.00)   0.03(0.03+0.00) +0.0%
    5311.52: server (8 days, --path-walk)           1.43(1.39+0.03)   0.01(0.00+0.00) -99.3%
    5311.53: size   (8 days, --path-walk)                    344.9K            344.9K +0.0%
    5311.54: client (8 days, --path-walk)           0.07(0.07+0.00)   0.07(0.08+0.00) +0.0%
    5311.56: server (16 days, --path-walk)          1.47(1.44+0.03)   0.10(0.08+0.01) -93.2%
    5311.57: size   (16 days, --path-walk)                   844.0K            844.0K +0.0%
    5311.58: client (16 days, --path-walk)          0.09(0.09+0.00)   0.09(0.09+0.00) +0.0%
    5311.60: server (32 days, --path-walk)          1.52(1.50+0.05)   0.14(0.15+0.02) -90.8%
    5311.61: size   (32 days, --path-walk)                     4.2M              4.2M +0.1%
    5311.62: client (32 days, --path-walk)          0.34(0.48+0.02)   0.34(0.45+0.05) +0.0%
    5311.64: server (64 days, --path-walk)          1.55(1.52+0.06)   0.15(0.15+0.04) -90.3%
    5311.65: size   (64 days, --path-walk)                     6.4M              6.4M -0.0%
    5311.66: client (64 days, --path-walk)          0.51(0.79+0.05)   0.51(0.80+0.06) +0.0%
    5311.68: server (128 days, --path-walk)         1.59(1.57+0.06)   0.16(0.21+0.01) -89.9%
    5311.69: size   (128 days, --path-walk)                    8.4M              8.4M -0.0%
    5311.70: client (128 days, --path-walk)         0.72(1.44+0.08)   0.71(1.47+0.09) -1.4%

We get the same size of output pack, but this commit allows us to do so
in a significantly shorter amount of time. Intuitively, we're generating
the same pack (hence the unchanged 'test_size' output from run to run),
but varying how we get there. Before this commit, pack-objects prefers
'--path-walk' to '--use-bitmap-index', so we generate the output pack by
performing a normal '--path-walk' traversal. With this commit, we are
operating over a *repacked* state (that itself was done with a
'--path-walk' traversal), but are able to perform pack-reuse on that
repacked state via bitmaps.

When comparing the size of the repacked pack with/without '--path-walk'
on the previous commit versus this one, we see that (a) the repacked size
improves significantly with '--path-walk', and that (b) writing bitmaps
during repacking does not regress this improvement:

    Test                                            HEAD^             HEAD
    ----------------------------------------------------------------------------------------
    5311.3: size of bitmapped pack                           558.4M            558.5M +0.0%
    5311.38: size of bitmapped pack (--path-walk)            164.4M            164.4M +0.0%

(Note that to observe an improvement here, we must repack with '-F' in
order to avoid reusing non-'--path-walk' deltas, which would otherwise
skew our results.)

There is one wrinkle when it comes to '--boundary', which we must not
pass into the bitmap walk in the presence of both '--path-walk' and
'--use-bitmap-index'. Path-walk needs boundary commits when it performs
its own traversal, in order to discover bases for thin packs, but the
bitmap traversal does not expect this. Work around this by setting
`revs->boundary` as late as possible within the '--path-walk' traversal,
after any bitmap attempt has either succeeded or declined to answer the
request.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-21 16:26:14 -07:00
Patrick Steinhardt
7fa8c61afe midx: refactor interfaces to work on "packed" source
Our interfaces used to interact with MIDXs all work on top of the
generic `struct odb_source`. This doesn't make much sense though: a MIDX
is strictly tied to the "packed" source, so passing in a generic source
gives the false sense that it may also work with a different type of
source.

Fix this conceptual weirdness and instead require the caller to pass in
a "packed" source explicitly. This also makes the next commit easier to
implement, where we drop the pointer to the "files" source in the
"packed" source.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-17 05:00:01 -07:00
Patrick Steinhardt
7ed53cde28 odb/source-packed: wire up for_each_object() callback
Move `packfile_store_for_each_object()` and its associated helpers from
"packfile.c" into "odb/source-packed.c" and wire it up as the
`for_each_object()` callback of the "packed" source.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-17 05:00:00 -07:00
Junio C Hamano
6e148f82dc Merge branch 'kk/streaming-walk-pqueue'
Streaming revision walks have been optimized by using a priority queue
for date-sorting commits, speeding up walks repositories with many
merges.

* kk/streaming-walk-pqueue:
  revision: use priority queue for non-limited streaming walks
  revision: introduce rev_walk_mode to clarify get_revision_1()
  pack-objects: call release_revisions() after cruft traversal
2026-06-16 09:01:02 -07:00
Johannes Schindelin
c6a4629e32 odb: use size_t for object_info.sizep and the size APIs
When `js/objects-larger-than-4gb-on-windows` widened the streaming,
index-pack and unpack-objects code paths, in the interest of keeping the
patches somewhat reasonably-sized, it left the public ODB API still
typed in `unsigned long`. In particular `struct object_info::sizep` and
the four wrappers built on top of it (`odb_read_object`,
`odb_read_object_peeled`, `odb_read_object_info`, `odb_pretend_object`)
still return the unpacked size through `unsigned long *`, so on Windows
`cat-file -s` and the `git add` / `git status` paths for a >4 GiB blob
silently cap at 4 GiB.

Widen the field and the four wrappers. The previous commits already
widened the `unpack_entry()` cascade and pack-objects' in-core size
accessors, so most of the cascade arrives here with no further work: the
temporary shims in `packed_object_info_with_index_pos()` and in
`unpack_entry()`'s delta-base recovery path go away, the two
`SET_SIZE(entry, cast_size_t_to_ulong(canonical_size))` calls in
`check_object()` and the matching one in `drop_reused_delta()` collapse
to plain `SET_SIZE`, and `oe_get_size_slow()`'s tail
`cast_size_t_to_ulong()` is gone too.

What remains narrow are the boundaries this series does not
intend to touch: the diff, blame, textconv and fast-import machinery.

Even so, this patch is unfortunately quite large.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-15 07:45:41 -07:00
Johannes Schindelin
188bac14f7 pack-objects: use size_t for in-core object sizes
`pack-objects` stores per-entry object sizes in either the 31-bit
`size_` member of the `struct object_entry` or, when the value does not
fit, the `pack->delta_size[]` spill array.  The accessors (`oe_size`,
`oe_delta_size`, `oe_get_size_slow`, `oe_size_*_than`) and the setters
(`oe_set_size`, `oe_set_delta_size`) used `unsigned long` for the spill
type, which on Windows means the spill silently caps at 4 GiB per entry.
That is what made `upload-pack` die with "object too large to read on
this platform" when serving the >4 GiB blob in `t5608` tests 5 and 6
when run with `GIT_TEST_CLONE_2GB`.

Widen them all to `size_t` (including `pack->delta_size`) and drop the
three `cast_size_t_to_ulong()` calls in `check_object()` that guarded
`in_pack_size`.  The two `SET_SIZE(entry, canonical_size)` calls in the
same function stay cast-free as before, since `canonical_size` is still
`unsigned long` until a later commit widens `object_info::sizep`.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-15 07:45:41 -07:00
Johannes Schindelin
1d43315b31 pack-objects(check_pack_inflate()): use size_t instead of unsigned long
`write_reuse_object()` learned to track its packed-object size as
`size_t` in 606c192380 (odb, packfile: use size_t for streaming
object sizes, 2026-05-08), but the comparison sink it feeds,
`check_pack_inflate()`, still takes the expected decompressed size
as `unsigned long`. The call site bridges the mismatch with
`cast_size_t_to_ulong()`, which on Windows turns a >4 GiB object
into an immediate die().

That function only uses `expect` once: as the right-hand side of a
`stream.total_out == expect` equality test against zlib's counter.
zlib's own `total_out` counter is `uLong` and is therefore still
32-bit-bound on Windows. Widening `expect` to `size_t` cannot fix that,
but it is a strict improvement nonetheless: instead of dying outright,
an oversized object now simply makes the equality fail and lets
`write_reuse_object()` fall back to `write_no_reuse_object()`, which
decompresses and re-deflates the content (and which the larger
pack-objects widening series targets separately).

Drop the `cast_size_t_to_ulong()` shim at the call site now that
the receiving parameter speaks the same type as `entry_size`.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-15 07:45:40 -07:00
Junio C Hamano
ff1784217f Merge branch 'ak/typofixes'
Typofixes.

* ak/typofixes:
  doc: fix typos via codespell
2026-06-15 07:42:00 -07:00
Junio C Hamano
883a47ef64 Merge branch 'ob/more-repo-config-values'
Many core configuration variables have been migrated from global
variables into 'repo_config_values' to tie them to a specific
repository instance, avoiding cross-repository state leakage.

* ob/more-repo-config-values:
  environment: move "warn_on_object_refname_ambiguity" into `struct repo_config_values`
  environment: move "sparse_expect_files_outside_of_patterns" into `struct repo_config_values`
  environment: move "core_sparse_checkout_cone" into `struct repo_config_values`
  environment: move "precomposed_unicode" into `struct repo_config_values`
  environment: move "pack_compression_level" into `struct repo_config_values`
  environment: move `zlib_compression_level` into `struct repo_config_values`
  environment: move "check_stat" into `struct repo_config_values`
  environment: move "trust_ctime" into `struct repo_config_values`
2026-06-15 07:42:00 -07:00
Junio C Hamano
06f63df846 Merge branch 'ps/odb-source-loose'
The loose object source has been refactored into a proper `struct
odb_source`.

* ps/odb-source-loose:
  odb/source-loose: drop pointer to the "files" source
  odb/source-loose: stub out remaining callbacks
  odb/source-loose: wire up `write_object_stream()` callback
  object-file: refactor writing objects to use loose source
  odb/source-loose: wire up `write_object()` callback
  loose: refactor object map to operate on `struct odb_source_loose`
  odb/source-loose: wire up `freshen_object()` callback
  odb/source-loose: drop `odb_source_loose_has_object()`
  odb/source-loose: wire up `count_objects()` callback
  odb/source-loose: wire up `find_abbrev_len()` callback
  odb/source-loose: wire up `for_each_object()` callback
  odb/source-loose: wire up `read_object_stream()` callback
  odb/source-loose: wire up `read_object_info()` callback
  odb/source-loose: wire up `close()` callback
  odb/source-loose: wire up `reprepare()` callback
  odb/source-loose: start converting to a proper `struct odb_source`
  odb/source-loose: store pointer to "files" instead of generic source
  odb/source-loose: move loose source into "odb/" subsystem
2026-06-11 04:31:18 -07:00
Andrew Kreimer
014c454799 doc: fix typos via codespell
There are some typos in the documentation, comments, etc.
Fix them via codespell, and then adjust the "dump" files
used by the subversion tests to match the updated contents.

Signed-off-by: Andrew Kreimer <algonell@gmail.com>
[dscho noticed and fixed the problems in svn test]
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
[jc did final assembling of the three patches]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-08 00:21:35 +09:00
Junio C Hamano
f985a6ec65 Merge branch 'ps/odb-source-loose' into ps/odb-source-packed
* ps/odb-source-loose:
  odb/source-loose: drop pointer to the "files" source
  odb/source-loose: stub out remaining callbacks
  odb/source-loose: wire up `write_object_stream()` callback
  object-file: refactor writing objects to use loose source
  odb/source-loose: wire up `write_object()` callback
  loose: refactor object map to operate on `struct odb_source_loose`
  odb/source-loose: wire up `freshen_object()` callback
  odb/source-loose: drop `odb_source_loose_has_object()`
  odb/source-loose: wire up `count_objects()` callback
  odb/source-loose: wire up `find_abbrev_len()` callback
  odb/source-loose: wire up `for_each_object()` callback
  odb/source-loose: wire up `read_object_stream()` callback
  odb/source-loose: wire up `read_object_info()` callback
  odb/source-loose: wire up `close()` callback
  odb/source-loose: wire up `reprepare()` callback
  odb/source-loose: start converting to a proper `struct odb_source`
  odb/source-loose: store pointer to "files" instead of generic source
  odb/source-loose: move loose source into "odb/" subsystem
2026-06-05 22:26:06 +09:00
Olamide Caleb Bello
8407abf02a environment: move "warn_on_object_refname_ambiguity" into struct repo_config_values
The `core.warnAmbiguousRefs` configuration was previously stored in a
global `int` variable, making it shared across repository instances
and risking cross‑repository state leakage.

Store it instead in `repo_config_values`, where eagerly‑parsed
repository configuration lives. This option is parsed eagerly because
ambiguity warnings influence how users interpret object references in
many commands; a lazy parse could cause these warnings to behave
inconsistently or to appear for the wrong repository, confusing users
and hindering libification. This preserves the existing behavior while
tying the value to the repository from which it was read, avoiding
cross‑repository state leakage and continuing the effort to reduce
reliance on global configuration state.

Update all references to use `repo_config_values()`.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
Signed-off-by: Olamide Caleb Bello <belkid98@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-03 08:36:48 +09:00
Olamide Caleb Bello
8cd7402acc environment: move "pack_compression_level" into struct repo_config_values
The `pack_compression_level` configuration is currently stored in the
global variable `pack_compression_level`, which makes it shared across
repository instances within a single process.

Store it instead in `repo_config_values`, where eagerly‑parsed
repository configuration lives. `pack_compression_level` is parsed
eagerly because it influences packfile compression, a core operation
where a lazy parse could cause inconsistent behavior and hamper
libification. This preserves the existing eager‑parsing behavior while
tying the value to the repository from which it was read, avoiding
cross‑repository state leakage and continuing the effort to reduce
reliance on global configuration state.

Update all references to use `repo_config_values()`.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
Signed-off-by: Olamide Caleb Bello <belkid98@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-03 08:36:48 +09:00
Junio C Hamano
ffaa2eddd0 Merge branch 'ds/path-walk-filters'
The "git pack-objects --path-walk" traversal has been integrated
with several object filters, including blobless and sparse filters.

* ds/path-walk-filters:
  path-walk: support `combine` filter
  path-walk: support `object:type` filter
  path-walk: support `tree:0` filter
  t6601: tag otherwise-unreachable trees
  pack-objects: support sparse:oid filter with path-walk
  path-walk: add pl_sparse_trees to control tree pruning
  path-walk: support blob size limit filter
  backfill: die on incompatible filter options
  path-walk: support blobless filter
  path-walk: always emit directly-requested objects
  t/perf: add pack-objects filter and path-walk benchmark
  pack-objects: pass --objects with --path-walk
  t5620: make test work with path-walk var
2026-06-02 16:15:29 +09:00
Patrick Steinhardt
86f7ab5a1f odb/source-loose: drop odb_source_loose_has_object()
The function `odb_source_loose_has_object()` checks whether a specific
object exists as a loose object on disk by using lstat(3p). This
interface is somewhat redundant, as we typically check for object
existence in a generic way via `odb_source_read_object_info()`.

In fact, these two calls are redundant in case the latter is called in a
specific way: when called without an object info request and without the
`OBJECT_INFO_QUICK` flag, then we will end up doing the same call to
lstat(3p) in `read_object_info_from_path()`.

Drop the function and adapt callers to instead use the generic
interface so that its calling conventions align with that of other
sources.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:18 +09:00
Junio C Hamano
59cccb3b0c Merge branch 'ds/path-walk-filters' into tb/pack-path-walk-bitmap-delta-islands
* ds/path-walk-filters:
  path-walk: support `combine` filter
  path-walk: support `object:type` filter
  path-walk: support `tree:0` filter
  t6601: tag otherwise-unreachable trees
  pack-objects: support sparse:oid filter with path-walk
  path-walk: add pl_sparse_trees to control tree pruning
  path-walk: support blob size limit filter
  backfill: die on incompatible filter options
  path-walk: support blobless filter
  path-walk: always emit directly-requested objects
  t/perf: add pack-objects filter and path-walk benchmark
  pack-objects: pass --objects with --path-walk
  t5620: make test work with path-walk var
2026-05-30 10:10:56 +09:00
Kristofer Karlsson
9f4e170dfc pack-objects: call release_revisions() after cruft traversal
enumerate_and_traverse_cruft_objects() initializes a rev_info on the
stack but never calls release_revisions() afterwards.  This is not
visible on master but becomes a leak once the revision walking
machinery uses dynamically allocated structures.

Add the missing release_revisions() call.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-28 06:08:19 +09:00
Derrick Stolee
2dc858e69e pack-objects: support sparse:oid filter with path-walk
The --filter=sparse:<oid> option to 'git pack-objects' allows focusing
an object set to a sparse-checkout definition. This reduces the set of
matching blobs while retaining all reachable trees. No server currently
supports fetching with this filter because it is expensive to compute
and reachability bitmaps do not help without a significant effort to
extend the bitmap feature to store bitmaps for each supported sparse-
checkout definition.

Without focusing on serving fetches and clones with these filters, there
are still benefits that could be realized by making this faster. With
the sparse index, it's more realistic now than ever to be able to
operate a local clone that was bootstrapped by a packfile created with
a sparse filter, because the missing trees are not needed to move a
sparse-checkout from one commit to another or to view the history of any
path in scope. Such clones could perhaps be bootstrapped by partial
bundles.

Previously, constructing these sparse packs has been incredibly
computationally inefficient. The revision walk that explores which
objects are in scope spends a lot of time checking each object to see if
it matches the sparse-checkout patterns, causing quadratic behavior
(number of objects times number of sparse-checkout patterns). This
improves somewhat when using cone-mode sparse-checkout patterns that can
use hashtables and prefix matches to determine containment. However, the
check per object is still too expensive for most cases.

This is where the path-walk feature comes in. We can proceed as normal
by placing objects in bins by path and _then_ check a group of objects
all at once. Since sparse:<oid> only restricts blobs, the path-walk must
include all reachable trees while using the cone-mode patterns to skip
blobs at paths outside the sparse scope. This establishes a baseline for
a potential future "treesparse:<oid>" filter that would also restrict
trees, but introducing such a new filter is deferred to a later change.

The implementation here is focused around loading the sparse-checkout
patterns from the provided object ID and checking that the patterns are
indeed cone-mode patterns. We can then load the correct pattern list
into the path walk context and use the logic that already exists from
bff4555767 (backfill: add --sparse option, 2025-02-03), though that
feature loads sparse-checkout patterns from the worktree's local
settings and also restricts tree objects. We use a combination of errors
and warnings to signal problems during this load. The difference is that
errors are likely fatal for the non-path-walk version while the warnings
are probably just implementation details for the path-walk version and
the 'git pack-objects' command can fall back to the revision walk
version.

Now that the SEEN flag is deferred until after pattern checks (from the
previous commit), handle the case where a tree with a shared OID appears
at both an out-of-cone and in-cone path. When trees are not being pruned
(pl_sparse_trees == 0), the path-walk re-walks the tree at the in-cone
path so that in-cone blobs within it are discovered. The new tests in
t5317 and t6601 demonstrate this behavior and would fail without these
changes.

The performance test p5315 shows the impact of this change when using
sparse filters:

Test                                              HEAD~1     HEAD
----------------------------------------------------------------------
5315.10: repack (sparse:oid)                      77.98    77.47  -0.7%
5315.11: repack size (sparse:oid)                187.5M   187.4M  -0.0%
5315.12: repack (sparse:oid, --path-walk)         77.91    31.41 -59.7%
5315.13: repack size (sparse:oid, --path-walk)   187.5M   161.1M -14.1%

These performance tests were run on the Git repository. The --path-walk
feature shows meaningful space savings (14% smaller for sparse packs)
and dramatic time savings (60% faster) by leveraging the path-walk's
ability to skip blobs outside the sparse scope.

Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blaue <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-24 18:41:06 +09:00
Derrick Stolee
6d87f0e8a3 path-walk: support blobless filter
The 'git pack-objects' command can opt-in to using the path-walk API for
scanning the objects. Currently, this option is dynamically disabled if
combined with '--filter=<X>', even when using a simple filter such as
'blob:none' to signal a blobless packfile. This is a common scenario for
repos at scale, so is worth integrating.

Also, users can opt-in to the '--path-walk' option by default through
the pack.usePathWalk=true config option. When using that in a blobless
partial clone, the following warning can appear even though the user did
not specify either option directly:

  warning: cannot use --filter with --path-walk

Teach the path-walk API to handle the 'blob:none' object filter
natively. When revs->filter.choice is LOFC_BLOB_NONE, the path-walk
sets info->blobs to 0 (skipping all blob objects) and clears the
filter from revs so that prepare_revision_walk() does not reject the
configuration.

This check is implemented in the static prepare_filters() method, which
will simultaneously check if the input filters are compatible and will
make the appropriate mutations to the path_walk_info and filters if the
path_walk_info is non-NULL. This allows us to use this logic both in the
API method path_walk_filter_compatible() for use in
builtin/pack-objects.c and as a prep step in walk_objects_by_path().

Update the test helper (test-path-walk) to accept --filter=<spec>
as a test-tool option (before '--'), applying it to revs after
setup_revisions() to avoid the --objects requirement check. We can also
revert recent GIT_TEST_PACK_PATH_WALK overrides in t5620.

Also switch test-path-walk from REV_INFO_INIT with manual repo
assignment to repo_init_revisions(), which properly initializes
the filter_spec strbuf needed for filter parsing.

Add tests for blob:none with --all and with a single branch.

The performance test p5315 shows the impact of this change when using
blobless filters:

Test                                           HEAD~1     HEAD
---------------------------------------------------------------------
5315.6: repack (blob:none)                      13.53   13.87  +2.5%
5315.7: repack size (blob:none)                137.7M  137.8M  +0.1%
5315.8: repack (blob:none, --path-walk)         13.51   23.43 +73.4%
5315.9: repack size (blob:none, --path-walk)   137.7M  115.2M -16.3%

These performance tests were run on the Git repository. The --path-walk
feature shows meaningful space savings (16% smaller for blobless packs)
at the cost of increased computation time due to the two compression
passes. This data demonstrates that the feature is engaged and provides
real compression benefits when --no-reuse-delta forces fresh deltas.

Co-Authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-24 18:41:06 +09:00
Derrick Stolee
35567889ef pack-objects: pass --objects with --path-walk
When 'git pack-objects' has the --path-walk option enabled, it uses a
different set of revision walk parameters than normal. For one,
--objects was previously assumed by the path-walk API and could be
omitted. We also needed --boundary to allow discovering UNINTERESTING
objects to use as delta bases.

We will be updating the path-walk API soon to work with some filter
options. However, the revision machinery will trigger a fatal error:

  fatal: object filtering requires --objects

The fix is easy: add the --objects option as an argument. This has no
effect on the path-walk API but does simplify the revision option
parsing for the objects filter.

We can remove the comment about "removing" the options because they were
never removed and instead not added. We still need to disable using
bitmaps.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-24 18:41:06 +09:00
Johannes Schindelin
606c192380 odb, packfile: use size_t for streaming object sizes
The odb_read_stream structure uses unsigned long for the size field,
which is 32-bit on Windows even in 64-bit builds. When streaming
objects larger than 4GB, the size would be truncated to zero or an
incorrect value, resulting in empty files being written to disk.

Change the size field in odb_read_stream to size_t and introduce
unpack_object_header_sz() to return sizes via size_t pointer. Since
object_info.sizep remains unsigned long for API compatibility, use
temporary variables where the types differ, with comments noting the
truncation limitation for code paths that still use unsigned long.

Widening the producers to size_t in this way introduces a handful of
silent size_t -> unsigned long narrowings on Windows, all in
builtin/pack-objects.c, where the consumers are still typed
unsigned long. Make those narrowings explicit with
cast_size_t_to_ulong() so they assert loudly the moment an object
actually exceeds ULONG_MAX bytes:

  - oe_get_size_slow() returns unsigned long but holds a size_t
    locally; cast at the return.
  - write_reuse_object() passes a size_t into check_pack_inflate(),
    whose expect parameter is unsigned long; cast at the call.
  - check_object() routes a size_t through SET_SIZE() and
    SET_DELTA_SIZE(), both of which take unsigned long via
    oe_set_size() / oe_set_delta_size(); cast at the three call
    sites in the OBJ_OFS_DELTA / OBJ_REF_DELTA branches and in the
    non-delta default arm.

The cast-only treatment is deliberately a stop-gap. Properly
widening oe_set_size, oe_get_size_slow's return type,
check_pack_inflate's expect parameter, object_info.sizep,
patch_delta, and the OE_SIZE_BITS bit-fields cascades into a series
that is too large to be reviewable, so the proper widening is
deferred to a follow-up topic. Until then,
cast_size_t_to_ulong() at least makes the truncation explicit at
the source: it documents the boundary, and on a 64-bit non-Windows
platform it is a no-op.

This was originally authored by LordKiRon <https://github.com/LordKiRon>,
who preferred not to reveal their real name and therefore agreed that I
take over authorship.

Helped-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-09 11:25:31 +09:00
Junio C Hamano
03311dca7f Merge branch 'tb/stdin-packs-excluded-but-open'
pack-objects's --stdin-packs=follow mode learns to handle
excluded-but-open packs.

* tb/stdin-packs-excluded-but-open:
  repack: mark non-MIDX packs above the split as excluded-open
  pack-objects: support excluded-open packs with --stdin-packs
  t7704: demonstrate failure with once-cruft objects above the geometric split
  pack-objects: refactor `read_packs_list_from_stdin()` to use `strmap`
  pack-objects: plug leak in `read_stdin_packs()`
2026-04-06 15:42:49 -07:00
Junio C Hamano
d75badf83b Merge branch 'ps/odb-generic-object-name-handling'
Object name handling (disambiguation and abbreviation) has been
refactored to be backend-generic, moving logic into the respective
object database backends.

* ps/odb-generic-object-name-handling:
  odb: introduce generic `odb_find_abbrev_len()`
  object-file: move logic to compute packed abbreviation length
  object-name: move logic to compute loose abbreviation length
  object-name: simplify computing common prefixes
  object-name: abbreviate loose object names without `disambiguate_state`
  object-name: merge `update_candidates()` and `match_prefix()`
  object-name: backend-generic `get_short_oid()`
  object-name: backend-generic `repo_collect_ambiguous()`
  object-name: extract function to parse object ID prefixes
  object-name: move logic to iterate through packed prefixed objects
  object-name: move logic to iterate through loose prefixed objects
  odb: introduce `struct odb_for_each_object_options`
  oidtree: extend iteration to allow for arbitrary return codes
  oidtree: modernize the code a bit
  object-file: fix sparse 'plain integer as NULL pointer' error
2026-04-06 15:42:49 -07:00
Taylor Blau
3f7c0e722e pack-objects: support excluded-open packs with --stdin-packs
In cd846bacc7 (pack-objects: introduce '--stdin-packs=follow',
2025-06-23), pack-objects learned to traverse through commits in
included packs when using '--stdin-packs=follow', rescuing reachable
objects from unlisted packs into the output.

When we encounter a commit in an excluded pack during this rescuing
phase we will traverse through its parents. But because we set
`revs.no_kept_objects = 1`, commit simplification will prevent us from
showing it via `get_revision()`. (In practice, `--stdin-packs=follow`
walks commits down to the roots, but only opens up trees for ones that
do not appear in an excluded pack.)

But there are certain cases where we *do* need to see the parents of an
object in an excluded pack. Namely, if an object is rescue-able, but
only reachable from object(s) which appear in excluded packs, then
commit simplification will exclude those commits from the object
traversal, and we will never see a copy of that object, and thus not
rescue it.

This is what causes the failure in the previous commit during repacking.
When performing a geometric repack, packs above the geometric split that
weren't part of the previous MIDX (e.g., packs pushed directly into
`$GIT_DIR/objects/pack`) may not have full object closure.  When those
packs are listed as excluded via the '^' marker, the reachability
traversal encounters the sequence described above, and may miss objects
which we expect to rescue with `--stdin-packs=follow`.

Introduce a new "excluded-open" pack prefix, '!'. Like '^'-prefixed
packs, objects from '!'-prefixed packs are excluded from the resulting
pack. But unlike '^', commits in '!'-prefixed packs *are* used as
starting points for the follow traversal, and the traversal does not
treat them as a closure boundary.

In order to distinguish excluded-closed from excluded-open packs during
the traversal, introduce a new `pack_keep_in_core_open` bit on
`struct packed_git`, along with a corresponding `KEPT_PACK_IN_CORE_OPEN`
flag for the kept-pack cache.

In `add_object_entry_from_pack()`, move the `want_object_in_pack()`
check to *after* `add_pending_oid()`. This is necessary so that commits
from excluded-open packs are added as traversal tips even though their
objects won't appear in the output. As a consequence, the caller
`for_each_object_in_pack()` will always provide a non-NULL 'p', hence we
are able to drop the "if (p)" conditional.

The `include_check` and `include_check_obj` callbacks on `rev_info` are
used to halt the walk at closed-excluded packs, since objects behind a
'^' boundary are guaranteed to have closure and need not be rescued.

The following commit will make use of this new functionality within the
repack layer to resolve the test failure demonstrated in the previous
commit.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-27 13:40:40 -07:00
Taylor Blau
d31d1f2e06 pack-objects: refactor read_packs_list_from_stdin() to use strmap
The '--stdin-packs' mode of pack-objects maintains two separate
string_lists: one for included packs, and one for excluded packs. Each
list stores the pack basename as a string and the corresponding
`packed_git` pointer in its `->util` field.

This works, but makes it awkward to extend the set of pack "kinds" that
pack-objects can accept via stdin, since each new kind would need its
own string_list and duplicated handling. A future commit will want to do
just this, so prepare for that change by handling the various "kinds" of
packs specified over stdin in a more generic fashion.

Namely, replace the two `string_list`s with a single `strmap` keyed on
the pack basename, with values pointing to a new `struct
stdin_pack_info`. This struct tracks both the `packed_git` pointer and a
`kind` bitfield indicating whether the pack was specified as included or
excluded.

Extract the logic for sorting packs by mtime and adding their objects
into a separate `stdin_packs_add_pack_entries()` helper.

While we could have used a `string_list`, we must handle the case where
the same pack is specified more than once. With a `string_list` only, we
would have to pay a quadratic cost to either (a) insert elements into
their sorted positions, or (b) a repeated linear search, which is
accidentally quadratic. For that reason, use a strmap instead.

This patch does not include any functional changes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-27 13:40:39 -07:00
Taylor Blau
81e2906437 pack-objects: plug leak in read_stdin_packs()
The `read_stdin_packs()` function added originally via 339bce27f4
(builtin/pack-objects.c: add '--stdin-packs' option, 2021-02-22)
declares a `rev_info` struct but neglects to call `release_revisions()`
on it before returning, creating the potential for a leak.

The related change in 97ec43247c (pack-objects: declare 'rev_info' for
'--stdin-packs' earlier, 2025-06-23) carried forward this oversight and
did not address it.

Ensure that we call `release_revisions()` appropriately to prevent a
potential leak from this function. Note that in practice our `rev_info`
here does not have a present leak, hence t5331 passes cleanly before
this commit, even when built with SANITIZE=leak.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-27 13:40:39 -07:00
Junio C Hamano
8023abc632 Merge branch 'ps/upload-pack-buffer-more-writes'
Reduce system overhead "git upload-pack" spends on relaying "git
pack-objects" output to the "git fetch" running on the other end of
the connection.

* ps/upload-pack-buffer-more-writes:
  builtin/pack-objects: reduce lock contention when writing packfile data
  csum-file: drop `hashfd_throughput()`
  csum-file: introduce `hashfd_ext()`
  sideband: use writev(3p) to send pktlines
  wrapper: introduce writev(3p) wrappers
  compat/posix: introduce writev(3p) wrapper
  upload-pack: reduce lock contention when writing packfile data
  upload-pack: prefer flushing data over sending keepalive
  upload-pack: adapt keepalives based on buffering
  upload-pack: fix debug statement when flushing packfile data
2026-03-24 12:31:34 -07:00
Patrick Steinhardt
cfd575f0a9 odb: introduce struct odb_for_each_object_options
The `odb_for_each_object()` function only accepts a bitset of flags. In
a subsequent commit we'll want to change object iteration to also
support iterating over only those objects that have a specific prefix.
While we could of course add the prefix to the function signature, or
alternatively introduce a new function, both of these options don't
really seem to be that sensible.

Instead, introduce a new `struct odb_for_each_object_options` that can
be passed to a new `odb_for_each_object_ext()` function. Splice through
the options structure into the respective object database sources.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-20 13:16:41 -07:00
Patrick Steinhardt
835e0aaf6f builtin/pack-objects: reduce lock contention when writing packfile data
When running `git pack-objects --stdout` we feed the data through
`hashfd_ext()` with a progress meter and a smaller-than-usual buffer
length of 8kB so that we can track throughput more granularly. But as
packfiles tend to be on the larger side, this small buffer size may
cause a ton of write(3p) syscalls.

Originally, the buffer we used in `hashfd()` was 8kB for all use cases.
This was changed though in 2ca245f8be (csum-file.h: increase hashfile
buffer size, 2021-05-18) because we noticed that the number of writes
can have an impact on performance. So the buffer size was increased to
128kB, which improved performance a bit for some use cases.

But the commit didn't touch the buffer size for `hashd_throughput()`.
The reasoning here was that callers expect the progress indicator to
update frequently, and a larger buffer size would of course reduce the
update frequency especially on slow networks.

While that is of course true, there was (and still is, even though it's
now a call to `hashfd_ext()`) only a single caller of this function in
git-pack-objects(1). This command is responsible for writing packfiles,
and those packfiles are often on the bigger side. So arguably:

  - The user won't care about increments of 8kB when packfiles tend to
    be megabytes or even gigabytes in size.

  - Reducing the number of syscalls would be even more valuable here
    than it would be for multi-pack indices, which was the benchmark
    done in the mentioned commit, as MIDXs are typically significantly
    smaller than packfiles.

  - Nowadays, many internet connections should be able to transfer data
    at a rate significantly higher than 8kB per second.

Update the buffer to instead have a size of `LARGE_PACKET_DATA_MAX - 1`,
which translates to ~64kB. This limit was chosen because `git
pack-objects --stdout` is most often used when sending packfiles via
git-upload-pack(1), where packfile data is chunked into pktlines when
using the sideband. Furthermore, most internet connections should have a
bandwidth signifcantly higher than 64kB/s, so we'd still be able to
observe progress updates at a rate of at least once per second.

This change significantly reduces the number of write(3p) syscalls from
355,000 to 44,000 when packing the Linux repository. While this results
in a small performance improvement on an otherwise-unused system, this
improvement is mostly negligible. More importantly though, it will
reduce lock contention in the kernel on an extremely busy system where
we have many processes writing data at once.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-13 08:54:15 -07:00