Commit Graph

346 Commits

Author SHA1 Message Date
Johannes Schindelin
ae90a3ee5c Merge 'objects-larger-than-4gb-on-windows-pt2'
This is hidden in v2.55.0-rc0's own CI because of an omission in
5ba82911bc (ci: enable EXPENSIVE for contributor builds, 2026-05-11)
which fails to enable EXPENSIVE tests for tags.

Due to 7d78d5fc1a (ci: skip GitHub workflow runs for already-tested
commits/trees, 2020-10-08), the CI of `master` is now also mistakenly
green because it reuses the tag's CI run to prove that it's solid.

This is an evil merge by necessity because `survey.c` needs to adapt to
the changed function signatures.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-13 03:11:27 +00:00
Philip Oakley
0934e6fec5 hash algorithms: use size_t for section lengths
Continue walking the code path for the >4GB `hash-object --literally`
test to the hash algorithm step for LLP64 systems.

This patch lets the SHA1DC code use `size_t`, making it compatible with
LLP64 data models (as used e.g. by Windows).

The interested reader of this patch will note that we adjust the
signature of the `git_SHA1DCUpdate()` function without updating _any_
call site. This certainly puzzled at least one reviewer already, so here
is an explanation:

This function is never called directly, but always via the macro
`platform_SHA1_Update`, which is usually called via the macro
`git_SHA1_Update`. However, we never call `git_SHA1_Update()` directly
in `struct git_hash_algo`. Instead, we call `git_hash_sha1_update()`,
which is defined thusly:

    static void git_hash_sha1_update(git_hash_ctx *ctx,
                                     const void *data, size_t len)
    {
        git_SHA1_Update(&ctx->sha1, data, len);
    }

i.e. it contains an implicit downcast from `size_t` to `unsigned long`
(before this here patch). With this patch, there is no downcast anymore.

With this patch, finally, the t1007-hash-object.sh "files over 4GB hash
literally" test case is fixed.

Signed-off-by: Philip Oakley <philipoakley@iee.email>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-13 02:58:28 +00:00
Philip Oakley
cf20a42fd8 object-file.c: use size_t for header lengths
Continue walking the code path for the >4GB `hash-object --literally`
test. The `hash_object_file_literally()` function internally uses both
`hash_object_file()` and `write_object_file_prepare()`. Both function
signatures use `unsigned long` rather than `size_t` for the mem buffer
sizes. Use `size_t` instead, for LLP64 compatibility.

While at it, convert those function's object's header buffer length to
`size_t` for consistency. The value is already upcast to `uintmax_t` for
print format compatibility.

Note: The hash-object test still does not pass. A subsequent commit
continues to walk the call tree's lower level hash functions to identify
further fixes.

Signed-off-by: Philip Oakley <philipoakley@iee.email>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-13 02:58:28 +00:00
Junio C Hamano
7c63b245cd Merge branch 'ps/cat-file-remote-object-info' into seen
The `remote-object-info` command has been added to `git cat-file
--batch-command`, allowing clients to request object metadata
(currently size) from a remote server via protocol v2 without
downloading the entire object.

The client dynamically filters format placeholders based on
server-advertised capabilities and safely returns empty strings for
inapplicable or unsupported fields.

* ps/cat-file-remote-object-info:
  cat-file: make remote-object-info allow-list dynamic
  cat-file: validate remote atoms with allow_list
  cat-file: add remote-object-info to batch-command
  transport: add client support for object-info
  serve: advertise object-info feature
  fetch-pack: move fetch initialization
  connect: refactor packet writing
  fetch-pack: move function to connect.c
  t1006: split test utility functions into new "lib-cat-file.sh"
  cat-file: add declaration of variable i inside its for loop
  git-compat-util: add strtoul_ul() with error handling
  transport-helper: fix memory leak of helper on disconnect
2026-06-12 15:58:18 -07:00
Junio C Hamano
a1d02357f1 Merge branch 'ob/more-repo-config-values' into jch
Many core configuration variables have been migrated from global
variables into 'repo_config_values' to tie them to a specific
repository instance, avoiding cross-repository state leakage.

* ob/more-repo-config-values:
  environment: move "warn_on_object_refname_ambiguity" into `struct repo_config_values`
  environment: move "sparse_expect_files_outside_of_patterns" into `struct repo_config_values`
  environment: move "core_sparse_checkout_cone" into `struct repo_config_values`
  environment: move "precomposed_unicode" into `struct repo_config_values`
  environment: move "pack_compression_level" into `struct repo_config_values`
  environment: move `zlib_compression_level` into `struct repo_config_values`
  environment: move "check_stat" into `struct repo_config_values`
  environment: move "trust_ctime" into `struct repo_config_values`
2026-06-12 15:57:14 -07:00
Eric Ju
82b4ab8001 cat-file: add remote-object-info to batch-command
Since the `info` command in `cat-file --batch-command` prints object
info for a given object, it is natural to add another command in
`cat-file --batch-command` to print object info for a given object
from a remote.

Add `remote-object-info` to `cat-file --batch-command`.

While `info` takes object ids one at a time, this creates
overhead when making requests to a server. So `remote-object-info`
instead can take multiple object ids at once.

The `cat-file --batch-command` command is generally implemented in
the following manner:

 - Receive and parse input from user
 - Call respective function attached to command
 - Get object info, print object info

In --buffer mode, this changes to:

 - Receive and parse input from user
 - Store respective function attached to command in a queue
 - After flush, loop through commands in queue
    - Call respective function attached to command
    - Get object info, print object info

Notice how the getting and printing of object info is accomplished one
at a time. As described above, this creates a problem for making
requests to a server. Therefore, `remote-object-info` is implemented in
the following manner:

 - Receive and parse input from user
 If command is `remote-object-info`:
    - Get object info from remote
    - Loop through and print each object info
 Else:
    - Call respective function attached to command
    - Parse input, get object info, print object info

And finally for --buffer mode `remote-object-info`:
 - Receive and parse input from user
 - Store respective function attached to command in a queue
 - After flush, loop through commands in queue:
    If command is `remote-object-info`:
        - Get object info from remote
        - Loop through and print each object info
    Else:
        - Call respective function attached to command
        - Get object info, print object info

To summarize, `remote-object-info` gets object info from the remote and
then loops through the object info passed in, printing the info.

In order for `remote-object-info` to avoid remote communication
overhead in the non-buffer mode, the objects are passed in as such:

remote-object-info <remote> <oid> <oid> ... <oid>

rather than

remote-object-info <remote> <oid>
remote-object-info <remote> <oid>
...
remote-object-info <remote> <oid>

Helped-by: Jonathan Tan <jonathantanmy@google.com>
Helped-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Eric Ju <eric.peijian@gmail.com>
Signed-off-by: Pablo Sabater <pabloosabaterr@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-08 23:30:15 +09:00
Johannes Schindelin
f3aeae983a odb: use size_t for object_info.sizep and the size APIs
When `js/objects-larger-than-4gb-on-windows` widened the streaming,
index-pack and unpack-objects code paths, in the interest of keeping the
patches somewhat reasonably-sized, it left the public ODB API still
typed in `unsigned long`. In particular `struct object_info::sizep` and
the four wrappers built on top of it (`odb_read_object`,
`odb_read_object_peeled`, `odb_read_object_info`, `odb_pretend_object`)
still return the unpacked size through `unsigned long *`, so on Windows
`cat-file -s` and the `git add` / `git status` paths for a >4 GiB blob
silently cap at 4 GiB.

Widen the field and the four wrappers. The previous commits already
widened the `unpack_entry()` cascade and pack-objects' in-core size
accessors, so most of the cascade arrives here with no further work: the
temporary shims in `packed_object_info_with_index_pos()` and in
`unpack_entry()`'s delta-base recovery path go away, the two
`SET_SIZE(entry, cast_size_t_to_ulong(canonical_size))` calls in
`check_object()` and the matching one in `drop_reused_delta()` collapse
to plain `SET_SIZE`, and `oe_get_size_slow()`'s tail
`cast_size_t_to_ulong()` is gone too.

What remains narrow are the boundaries this series does not
intend to touch: the diff, blame, textconv and fast-import machinery.

Even so, this patch is unfortunately quite large.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-04 09:51:22 +02:00
Olamide Caleb Bello
8cd7402acc environment: move "pack_compression_level" into struct repo_config_values
The `pack_compression_level` configuration is currently stored in the
global variable `pack_compression_level`, which makes it shared across
repository instances within a single process.

Store it instead in `repo_config_values`, where eagerly‑parsed
repository configuration lives. `pack_compression_level` is parsed
eagerly because it influences packfile compression, a core operation
where a lazy parse could cause inconsistent behavior and hamper
libification. This preserves the existing eager‑parsing behavior while
tying the value to the repository from which it was read, avoiding
cross‑repository state leakage and continuing the effort to reduce
reliance on global configuration state.

Update all references to use `repo_config_values()`.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
Signed-off-by: Olamide Caleb Bello <belkid98@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-03 08:36:48 +09:00
Olamide Caleb Bello
e0f86540ab environment: move zlib_compression_level into struct repo_config_values
The `zlib_compression_level` configuration is currently stored in the
global variable `zlib_compression_level`, which makes it shared across
repository instances within a single process.

Store it instead in `repo_config_values`, where eagerly‑parsed
repository configuration lives. `zlib_compression_level` is parsed
eagerly because it determines compression behaviour for objects and
packs – core operations where a lazy parse could lead to unpredictable
results and hinder libification. This preserves the existing
eager‑parsing behavior while tying the value to the repository it
was read from, avoiding cross‑repository state leakage and continuing
the effort to reduce reliance on global configuration state.

Update all references to use `repo_config_values()`.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
Signed-off-by: Olamide Caleb Bello <belkid98@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-03 08:36:48 +09:00
Patrick Steinhardt
b9906a645c object-file: refactor writing objects to use loose source
The "object-file" subsystem still hosts the majority of logic used to
write loose objects. Eventually, we'll want to move this logic into
"odb/source-loose.c", but this isn't yet easily possible because a lot
of the writing logic is still being shared with `force_object_loose()`.

We will eventually detangle this logic so that we can indeed move all of
it into the "loose" source. Meanwhile though, refactor the code so that
it operates on a `struct odb_source_loose` directly to already make the
dependency explicit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:18 +09:00
Patrick Steinhardt
04a6e84cbd odb/source-loose: wire up write_object() callback
Move `odb_source_loose_write_object()` from "object-file.c" into
"odb/source-loose.c" and wire it up as the `write_object()` callback of
the loose source.

As in preceding commits, this requires us to expose a couple of generic
functions from "object-file.c" as they are used in both subsystems now.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:18 +09:00
Patrick Steinhardt
87588db131 loose: refactor object map to operate on struct odb_source_loose
While the loose object map functions in "loose.c" accept a generic
`struct odb_source *`, they always expect this to be the "files"
backend. Furthermore, the subsystem doesn't even care about the "files"
backend, but only uses it as a stepping stone to get to the "loose"
backend.

This assumption is implicit and thus not immediately obvious. Refactor
the interfaces to instead operate on a `struct odb_source_loose`
instead, which eliminates the implicit dependency and unnecessary detour
via the "files" source.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:18 +09:00
Patrick Steinhardt
d8b9e8bb23 odb/source-loose: wire up freshen_object() callback
Move `odb_source_loose_freshen_object()` from "object-file.c" into
"odb/source-loose.c" and wire it up as the `freshen_object()` callback
of the loose source.

As part of the move, `check_and_freshen_source()` is inlined into the
callback function, as it has no other callers anymore.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:18 +09:00
Patrick Steinhardt
86f7ab5a1f odb/source-loose: drop odb_source_loose_has_object()
The function `odb_source_loose_has_object()` checks whether a specific
object exists as a loose object on disk by using lstat(3p). This
interface is somewhat redundant, as we typically check for object
existence in a generic way via `odb_source_read_object_info()`.

In fact, these two calls are redundant in case the latter is called in a
specific way: when called without an object info request and without the
`OBJECT_INFO_QUICK` flag, then we will end up doing the same call to
lstat(3p) in `read_object_info_from_path()`.

Drop the function and adapt callers to instead use the generic
interface so that its calling conventions align with that of other
sources.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:18 +09:00
Patrick Steinhardt
2ade08ac29 odb/source-loose: wire up count_objects() callback
Move `odb_source_loose_count_objects()` and its associated helpers from
"object-file.c" into "odb/source-loose.c" and wire it up as the
`count_objects()` callback of the loose source.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:18 +09:00
Patrick Steinhardt
8a6da81cc1 odb/source-loose: wire up find_abbrev_len() callback
Move `odb_source_loose_find_abbrev_len()` and its associated helpers
from "object-file.c" into "odb/source-loose.c" and wire it up as the
`find_abbrev_len` callback of the loose source.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:18 +09:00
Patrick Steinhardt
e4f1d9ba57 odb/source-loose: wire up for_each_object() callback
Move `odb_source_loose_for_each_object()` and its associated helpers
from "object-file.c" into "odb/source-loose.c" and wire it up as the
`for_each_object()` callback of the loose source.

Again, as in the preceding commit, we are forced to expose a couple of
functions from "object-file.c" that are now used by both subsystems.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:18 +09:00
Patrick Steinhardt
727a935a71 odb/source-loose: wire up read_object_stream() callback
Move `odb_source_loose_read_object_stream()` and its associated helpers
from "object-file.c" into "odb/source-loose.c" and wire it up as the
`read_object_stream()` callback of the loose source.

As part of the move we are also forced to expose a couple of functions
from "object-file.h" that parse object headers in a somewhat-generic
way, as those functions are now used by both subsystems.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:18 +09:00
Patrick Steinhardt
584338ed92 odb/source-loose: wire up read_object_info() callback
Move `odb_source_loose_read_object_info()` from "object-file.c" into
"odb/source-loose.c" and wire it up as the `read_object_info()` callback
of the loose source. Callers that previously invoked it directly now go
through the generic `odb_source_read_object_info()` interface instead.

The function `read_object_info_from_path()` cannot be moved along with
it because it is still called by `for_each_object_wrapper_cb()`. It is
therefore kept in place, but adjusted to take a loose source to clarify
that it's always operating on this structure.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:17 +09:00
Patrick Steinhardt
a2b7db9bc8 odb/source-loose: wire up reprepare() callback
Move `odb_source_loose_reprepare()` from "object-file.c" into
"odb/source-loose.c" and wire it up as the `reprepare()` callback of the
loose source.

While at it, make `odb_source_loose_clear_cache()` static, as it is no
longer needed outside of its file.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:17 +09:00
Patrick Steinhardt
ead691927b odb/source-loose: start converting to a proper struct odb_source
Start converting `struct odb_source_loose` into a proper pluggable
`struct odb_source` by embedding the base struct and assigning it the
new `ODB_SOURCE_LOOSE` type. Furthermore, wire up lifecycle management
of this source by implementing the `free` callback and taking ownership
of the chdir notifications.

Note that the loose source is not yet functional as a standalone `struct
odb_source`, as it's missing all of the callback implementations. These
will be wired up in subsequent commits.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:17 +09:00
Patrick Steinhardt
1d451ba6fe odb/source-loose: store pointer to "files" instead of generic source
The `struct odb_source_loose` holds a pointer to its owning parent
source. The way that Git is currently structured, this parent is always
the "files" source. In subsequent commits we're going to detangle that
so that the "loose" source doesn't have any owning parent source at all
so that it can be used as a completely standalone source.

Detangling this mess is somewhat intricate though, and is made even more
intricate because it's not always clear which kind of source one is
holding at a specific point in time -- either the parent "files" source,
or the child "loose" source.

Make this relationship more explicit by storing a pointer to the "files"
source instead of storing a pointer to a generic `struct odb_source`.
This will help make subsequent steps a bit clearer.

Note that this is a temporary step, only. At the end of this series
we will have dropped the parent pointer completely.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:17 +09:00
Patrick Steinhardt
514f039c90 odb/source-loose: move loose source into "odb/" subsystem
In subsequent patches we'll be turning `struct odb_source_loose` into a
proper `struct odb_source`. As a first step towards this goal, move its
struct out of "object-file.c" and into "odb/source-loose.c".

This detaches the implementation of the loose object source from the
generic object file code, following the same convention already used by
the "files" and "in-memory" sources.

No functional changes are intended.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-06-01 18:47:17 +09:00
Junio C Hamano
0839e56957 Merge branch 'ps/odb-in-memory'
Add a new odb "in-memory" source that is meant to only hold
tentative objects (like the virtual blob object that represents the
working tree file used by "git blame").

* ps/odb-in-memory:
  t/unit-tests: add tests for the in-memory object source
  odb: generic in-memory source
  odb/source-inmemory: stub out remaining functions
  odb/source-inmemory: implement `freshen_object()` callback
  odb/source-inmemory: implement `count_objects()` callback
  odb/source-inmemory: implement `find_abbrev_len()` callback
  odb/source-inmemory: implement `for_each_object()` callback
  odb/source-inmemory: convert to use oidtree
  oidtree: add ability to store data
  cbtree: allow using arbitrary wrapper structures for nodes
  odb/source-inmemory: implement `write_object_stream()` callback
  odb/source-inmemory: implement `write_object()` callback
  odb/source-inmemory: implement `read_object_stream()` callback
  odb/source-inmemory: implement `read_object_info()` callback
  odb: fix unnecessary call to `find_cached_object()`
  odb/source-inmemory: implement `free()` callback
  odb: introduce "in-memory" source
2026-05-27 14:15:46 +09:00
Junio C Hamano
2f952b81ed Merge branch 'jt/odb-transaction-write'
ODB transaction interface is being reworked to explicitly handle
object writes.

* jt/odb-transaction-write:
  odb/transaction: make `write_object_stream()` pluggable
  object-file: generalize packfile writes to use odb_write_stream
  object-file: avoid fd seekback by checking object size upfront
  object-file: remove flags from transaction packfile writes
  odb: update `struct odb_write_stream` read() callback
  odb/transaction: use pluggable `begin_transaction()`
  odb: split `struct odb_transaction` into separate header
2026-05-27 14:15:45 +09:00
Junio C Hamano
9ebc19b760 Merge branch 'ps/odb-in-memory' into ps/odb-source-loose
* ps/odb-in-memory: (24 commits)
  t/unit-tests: add tests for the in-memory object source
  odb: generic in-memory source
  odb/source-inmemory: stub out remaining functions
  odb/source-inmemory: implement `freshen_object()` callback
  odb/source-inmemory: implement `count_objects()` callback
  odb/source-inmemory: implement `find_abbrev_len()` callback
  odb/source-inmemory: implement `for_each_object()` callback
  odb/source-inmemory: convert to use oidtree
  oidtree: add ability to store data
  cbtree: allow using arbitrary wrapper structures for nodes
  odb/source-inmemory: implement `write_object_stream()` callback
  odb/source-inmemory: implement `write_object()` callback
  odb/source-inmemory: implement `read_object_stream()` callback
  odb/source-inmemory: implement `read_object_info()` callback
  odb: fix unnecessary call to `find_cached_object()`
  odb/source-inmemory: implement `free()` callback
  odb: introduce "in-memory" source
  odb/transaction: make `write_object_stream()` pluggable
  object-file: generalize packfile writes to use odb_write_stream
  object-file: avoid fd seekback by checking object size upfront
  ...
2026-05-21 22:34:55 +09:00
Patrick Steinhardt
449650decf oidtree: add ability to store data
The oidtree data structure is currently only used to store object IDs,
without any associated data. So consequently, it can only really be used
to track which object IDs exist, and we can use the tree structure to
efficiently operate on OID prefixes.

But there are valid use cases where we want to both:

  - Store object IDs in a sorted order.

  - Associated arbitrary data with them.

Refactor the oidtree interface so that it allows us to store arbitrary
payloads within the respective nodes. This will be used in the next
commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-15 04:50:45 +09:00
Junio C Hamano
2f124686e8 Merge branch 'jt/odb-transaction-write' into ps/odb-in-memory
* jt/odb-transaction-write:
  odb/transaction: make `write_object_stream()` pluggable
  object-file: generalize packfile writes to use odb_write_stream
  object-file: avoid fd seekback by checking object size upfront
  object-file: remove flags from transaction packfile writes
  odb: update `struct odb_write_stream` read() callback
  odb/transaction: use pluggable `begin_transaction()`
  odb: split `struct odb_transaction` into separate header
2026-05-15 04:50:31 +09:00
Justin Tobler
08b6afb2a2 odb/transaction: make write_object_stream() pluggable
How an ODB transaction handles writing objects is expected to vary
between implementations. Introduce a new `write_object_stream()`
callback in `struct odb_transaction` to make this function pluggable.
Rename `index_blob_packfile_transaction()` to
`odb_transaction_files_write_object_stream()` and wire it up for use
with `struct odb_transaction_files` accordingly.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-15 04:44:40 +09:00
Justin Tobler
45a75d6187 object-file: generalize packfile writes to use odb_write_stream
The `index_blob_packfile_transaction()` function streams blob data
directly from an fd. This makes it difficult to reuse as part of a
generic transactional object writing interface.

Refactor the packfile write path to operate on a `struct
odb_write_stream`, allowing callers to supply data from arbitrary
sources.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-15 04:44:40 +09:00
Justin Tobler
d4c92e2ac9 object-file: avoid fd seekback by checking object size upfront
In certain scenarios, Git handles writing blobs that exceed
"core.bigFileThreshold" differently by streaming the object directly
into a packfile. When there is an active ODB transaction, these blobs
are streamed to the same packfile instead of using a separate packfile
for each. If "pack.packSizeLimit" is configured and streaming another
object causes the packfile to exceed the configured limit, the packfile
is truncated back to the previous object and the object write is
restarted in a new packfile.

This works fine, but requires the fd being read from to save a
checkpoint so it becomes possible to rewind the input source via seeking
back to a known offset at the beginning. In a subsequent commit, blob
streaming is converted to use `struct odb_write_stream` as a more
generic input source instead of an fd which doesn't provide a mechanism
for rewinding.

For this use case though, rewinding the fd is not strictly necessary
because the inflated size of the object is known and can be used to
approximate whether writing the object would cause the packfile to
exceed the configured limit prior to writing anything. These blobs
written to the packfile are never deltified thus the size difference
between what is written versus the inflated size is due to zlib
compression. While this does prevent packfiles from being filled to the
potential maximum is some cases, it should be good enough and still
prevents the packfile from exceeding any configured limit.

Use the inflated blob size to determine whether writing an object to a
packfile will exceed the configured "pack.packSizeLimit".

Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-15 04:44:40 +09:00
Justin Tobler
8a1f5ecf28 object-file: remove flags from transaction packfile writes
The `index_blob_packfile_transaction()` function handles streaming a
blob from an fd to compute its object ID and conditionally writes the
object directly to a packfile if the INDEX_WRITE_OBJECT flag is set. A
subsequent commit will make these packfile object writes part of the
transaction interface. Consequently, having the object write be
conditional on this flag is a bit awkward.

In preparation for this change, introduce a dedicated
`hash_blob_stream()` helper that only computes the OID from a `struct
odb_write_stream`. This is invoked by `index_fd()` instead when the
INDEX_WRITE_OBJECT is not set. The object write performed via
`index_blob_packfile_transaction()` is made unconditional accordingly.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-15 04:44:40 +09:00
Justin Tobler
970f63519e odb: update struct odb_write_stream read() callback
The `read()` callback used by `struct odb_write_stream` currently
returns a pointer to an internal buffer along with the number of bytes
read. This makes buffer ownership unclear and provides no way to report
errors.

Update the interface to instead require the caller to provide a buffer,
and have the callback return the number of bytes written to it or a
negative value on error. While at it, also move the `struct
odb_write_stream` definition to "odb/streaming.h". Call sites are
updated accordingly.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-15 04:44:40 +09:00
Justin Tobler
5f6744d3eb odb: split struct odb_transaction into separate header
The current ODB transaction interface is colocated with other ODB
interfaces in "odb.{c,h}". Subsequent commits will expand `struct
odb_transaction` to support write operations on the transaction
directly. To keep things organized and prevent "odb.{c,h}" from becoming
more unwieldy, split out `struct odb_transaction` into a separate
header.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-15 04:44:39 +09:00
Johannes Schindelin
606c192380 odb, packfile: use size_t for streaming object sizes
The odb_read_stream structure uses unsigned long for the size field,
which is 32-bit on Windows even in 64-bit builds. When streaming
objects larger than 4GB, the size would be truncated to zero or an
incorrect value, resulting in empty files being written to disk.

Change the size field in odb_read_stream to size_t and introduce
unpack_object_header_sz() to return sizes via size_t pointer. Since
object_info.sizep remains unsigned long for API compatibility, use
temporary variables where the types differ, with comments noting the
truncation limitation for code paths that still use unsigned long.

Widening the producers to size_t in this way introduces a handful of
silent size_t -> unsigned long narrowings on Windows, all in
builtin/pack-objects.c, where the consumers are still typed
unsigned long. Make those narrowings explicit with
cast_size_t_to_ulong() so they assert loudly the moment an object
actually exceeds ULONG_MAX bytes:

  - oe_get_size_slow() returns unsigned long but holds a size_t
    locally; cast at the return.
  - write_reuse_object() passes a size_t into check_pack_inflate(),
    whose expect parameter is unsigned long; cast at the call.
  - check_object() routes a size_t through SET_SIZE() and
    SET_DELTA_SIZE(), both of which take unsigned long via
    oe_set_size() / oe_set_delta_size(); cast at the three call
    sites in the OBJ_OFS_DELTA / OBJ_REF_DELTA branches and in the
    non-delta default arm.

The cast-only treatment is deliberately a stop-gap. Properly
widening oe_set_size, oe_get_size_slow's return type,
check_pack_inflate's expect parameter, object_info.sizep,
patch_delta, and the OE_SIZE_BITS bit-fields cascades into a series
that is too large to be reviewable, so the proper widening is
deferred to a follow-up topic. Until then,
cast_size_t_to_ulong() at least makes the truncation explicit at
the source: it documents the boundary, and on a 64-bit non-Windows
platform it is a no-op.

This was originally authored by LordKiRon <https://github.com/LordKiRon>,
who preferred not to reveal their real name and therefore agreed that I
take over authorship.

Helped-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-09 11:25:31 +09:00
Johannes Schindelin
d05d666977 git-zlib: handle data streams larger than 4GB
On Windows, zlib's `uLong` type is 32-bit even on 64-bit systems. When
processing data streams larger than 4GB, the `total_in` and `total_out`
fields in zlib's `z_stream` structure wrap around, which caused the
sanity checks in `zlib_post_call()` to trigger `BUG()` assertions.

The git_zstream wrapper now tracks its own 64-bit totals rather than
copying them from zlib. The sanity checks compare only the low bits,
using `maximum_unsigned_value_of_type(uLong)` to mask appropriately for
the platform's `uLong` size.

This is based on work by LordKiRon in git-for-windows#6076.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-09 11:25:31 +09:00
Junio C Hamano
fe4ab2e698 Merge branch 'jt/index-fd-wo-repo-regression-fix-maint'
During Git 2.52 timeframe, we broke streaming computation of object
hash outside a repository, which has been corrected.

* jt/index-fd-wo-repo-regression-fix-maint:
  object-file: avoid ODB transaction when not writing objects
2026-04-08 10:20:51 -07:00
Junio C Hamano
9797fed6ce Merge branch 'ps/odb-cleanup'
Various code clean-up around odb subsystem.

* ps/odb-cleanup:
  odb: drop unneeded headers and forward decls
  odb: rename `odb_has_object()` flags
  odb: use enum for `odb_write_object` flags
  odb: rename `odb_write_object()` flags
  treewide: use enum for `odb_for_each_object()` flags
  CodingGuidelines: document our style for flags
2026-04-08 10:19:17 -07:00
Justin Tobler
7d8727ff0b object-file: avoid ODB transaction when not writing objects
In ce1661f9da (odb: add transaction interface, 2025-09-16), existing
ODB transaction logic is adapted to create a transaction interface
at the ODB layer. The intent here is for the ODB transaction
interface to eventually provide an object source agnostic means to
manage transactions.

An unintended consequence of this change though is that
`object-file.c:index_fd()` may enter the ODB transaction path even
when no object write is requested. In non-repository contexts, this
can result in a NULL dereference and segfault. One such case occurs
when running git-diff(1) outside of a repository with
"core.bigFileThreshold" forcing the streaming path in `index_fd()`:

        $ echo foo >foo
        $ echo bar >bar
        $ git -c core.bigFileThreshold=1 diff -- foo bar

In this scenario, the caller only needs to compute the object ID. Object
hashing does not require an ODB, so starting a transaction is both
unnecessary and invalid.

Fix the bug by avoiding the use of ODB transactions in `index_fd()` when
callers are only interested in computing the object hash.

Reported-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Signed-off-by: Justin Tobler <jltobler@gmail.com>
[jc: adjusted to fd13909e (Merge branch 'jt/odb-transaction', 2025-10-02)]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-04-07 17:32:36 -07:00
Junio C Hamano
7b6d0cd51b Merge branch 'ps/fsck-wo-the-repository'
Internals of "git fsck" have been refactored to not depend on the
global `the_repository` variable.

* ps/fsck-wo-the-repository:
  builtin/fsck: stop using `the_repository` in error reporting
  builtin/fsck: stop using `the_repository` when marking objects
  builtin/fsck: stop using `the_repository` when checking packed objects
  builtin/fsck: stop using `the_repository` with loose objects
  builtin/fsck: stop using `the_repository` when checking reflogs
  builtin/fsck: stop using `the_repository` when checking refs
  builtin/fsck: stop using `the_repository` when snapshotting refs
  builtin/fsck: fix trivial dependence on `the_repository`
  fsck: drop USE_THE_REPOSITORY
  fsck: store repository in fsck options
  fsck: initialize fsck options via a function
  fetch-pack: move fsck options into function scope
2026-04-07 14:59:26 -07:00
Patrick Steinhardt
c63911b052 odb: rename odb_has_object() flags
Rename `odb_has_object()` flags to be properly prefixed with the
function name.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-31 20:43:14 -07:00
Patrick Steinhardt
b2d421ece6 odb: use enum for odb_write_object flags
We've got a couple of functions that accept `odb_write_object()` flags,
but all of them accept the flags as an `unsigned` integer. In fact, we
don't even have an `enum` for the flags field.

Introduce this `enum` and adapt functions accordingly according to our
coding style.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-31 20:43:13 -07:00
Patrick Steinhardt
ff2e9d85d6 odb: rename odb_write_object() flags
Rename `odb_write_object()` flags to be properly prefixed with the
function name.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-31 20:43:13 -07:00
Junio C Hamano
8e2964dc89 Merge branch 'ps/object-counting'
The logic to count objects has been cleaned up.

* ps/object-counting:
  odb: introduce generic object counting
  odb/source: introduce generic object counting
  object-file: generalize counting objects
  object-file: extract logic to approximate object count
  packfile: extract logic to count number of objects
  odb: stop including "odb/source.h"
2026-03-25 12:58:05 -07:00
Patrick Steinhardt
3749853908 fsck: store repository in fsck options
The fsck subsystem relies on `the_repository` quite a bit. While we
could of course explicitly pass a repository down the callchain, we
already have a `struct fsck_options` that we pass to almost all
functions.

Extend the options to also store the repository to make it readily
available.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-23 08:33:10 -07:00
Patrick Steinhardt
f223609026 fsck: initialize fsck options via a function
We initialize the `struct fsck_options` via a set of macros, often in
global scope. In the next commit though we're about to introduce a new
repository field to the options that must be initialized, and naturally
we don't have a repo other than `the_repository` available in this
scope.

Refactor the code to instead intrdouce a new `fsck_options_init()`
function that initializes the options for us and move initialization
into function scope.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-23 08:33:10 -07:00
Patrick Steinhardt
ab3ab1038d object-name: move logic to compute loose abbreviation length
The function `repo_find_unique_abbrev_r()` takes as input an object ID
as well as a minimum object ID length and returns the minimum required
prefix to make the object ID unique.

The logic that computes the abbreviation length for loose objects is
deeply tied to the loose object storage format. As such, it would fail
in case a different object storage format was used.

Prepare for making this logic generic to the backend by moving the logic
into a new `odb_source_loose_find_abbrev_len()` function that is part of
"object-file.c".

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-20 13:16:42 -07:00
Patrick Steinhardt
284b7862be object-name: move logic to iterate through loose prefixed objects
The logic to iterate through loose objects that have a certain prefix is
currently hosted in "object-name.c". This logic reaches into specifics
of the loose object source, so it breaks once a different backend is
used for the object storage.

Move the logic to iterate through loose objects with a prefix into
"object-file.c". This is done by extending the for-each-object options
to support an optional prefix that is then honored by the loose source.
Naturally, we'll also have this support in the packfile store. This is
done in the next commit.

Furthermore, there are no users of the loose cache outside of
"object-file.c" anymore. As such, convert `odb_source_loose_cache()` to
have file scope.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-20 13:16:42 -07:00
Patrick Steinhardt
cfd575f0a9 odb: introduce struct odb_for_each_object_options
The `odb_for_each_object()` function only accepts a bitset of flags. In
a subsequent commit we'll want to change object iteration to also
support iterating over only those objects that have a specific prefix.
While we could of course add the prefix to the function signature, or
alternatively introduce a new function, both of these options don't
really seem to be that sensible.

Instead, introduce a new `struct odb_for_each_object_options` that can
be passed to a new `odb_for_each_object_ext()` function. Splice through
the options structure into the respective object database sources.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-20 13:16:41 -07:00
Junio C Hamano
7f75767554 Merge branch 'ps/object-counting' into ps/odb-generic-object-name-handling
* ps/object-counting:
  object-file: fix sparse 'plain integer as NULL pointer' error
  odb: introduce generic object counting
  odb/source: introduce generic object counting
  object-file: generalize counting objects
  object-file: extract logic to approximate object count
  packfile: extract logic to count number of objects
  odb: stop including "odb/source.h"
2026-03-20 13:16:09 -07:00