Commit Graph

13912 Commits

Author SHA1 Message Date
Johannes Schindelin
7678ab52b1 Merge branch 'phase-out-reset-stdin'
This topic branch re-adds the deprecated --stdin/-z options to `git
reset`. Those patches were overridden by a different set of options in
the upstream Git project before we could propose `--stdin`.

We offered this in MinGit to applications that wanted a safer way to
pass lots of pathspecs to Git, and these applications will need to be
adjusted.

Instead of `--stdin`, `--pathspec-from-file=-` should be used, and
instead of `-z`, `--pathspec-file-nul`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:58:05 +00:00
Johannes Schindelin
117dd17b08 reset: reinstate support for the deprecated --stdin option
The `--stdin` option was a well-established paradigm in other commands,
therefore we implemented it in `git reset` for use by Visual Studio.

Unfortunately, upstream Git decided that it is time to introduce
`--pathspec-from-file` instead.

To keep backwards-compatibility for some grace period, we therefore
reinstate the `--stdin` option on top of the `--pathspec-from-file`
option, but mark it firmly as deprecated.

Helped-by: Victoria Dye <vdye@github.com>
Helped-by: Matthew John Cheetham <mjcheetham@outlook.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:58:05 +00:00
Ben Boeckel
54cb38dd2f clean: suggest using core.longPaths if paths are too long to remove
On Windows, git repositories may have extra files which need cleaned
(e.g., a build directory) that may be arbitrarily deep. Suggest using
`core.longPaths` if such situations are encountered.

Fixes: #2715
Signed-off-by: Ben Boeckel <mathstuf@gmail.com>
2026-06-26 08:58:04 +00:00
Johannes Schindelin
ffe582bf78 clean: make use of FSCache
The `git clean` command needs to enumerate plenty of files and
directories, and can therefore benefit from the FSCache.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:58:04 +00:00
Ben Peart
550881dc74 fscache: fscache takes an initial size
Update enable_fscache() to take an optional initial size parameter which is
used to initialize the hashmap so that it can avoid having to rehash as
additional entries are added.

Add a separate disable_fscache() macro to make the code clearer and easier
to read.

Signed-off-by: Ben Peart <benpeart@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:58:04 +00:00
Ben Peart
60e61c237f status: disable and free fscache at the end of the status command
At the end of the status command, disable and free the fscache so that we
don't leak the memory and so that we can dump the fscache statistics.

Signed-off-by: Ben Peart <benpeart@microsoft.com>
2026-06-26 08:58:04 +00:00
Takuto Ikuta
bca9b4e589 checkout.c: enable fscache for checkout again
This is retry of #1419.

I added flush_fscache macro to flush cached stats after disk writing
with tests for regression reported in #1438 and #1442.

git checkout checks each file path in sorted order, so cache flushing does not
make performance worse unless we have large number of modified files in
a directory containing many files.

Using chromium repository, I tested `git checkout .` performance when I
delete 10 files in different directories.
With this patch:
TotalSeconds: 4.307272
TotalSeconds: 4.4863595
TotalSeconds: 4.2975562
Avg: 4.36372923333333

Without this patch:
TotalSeconds: 20.9705431
TotalSeconds: 22.4867685
TotalSeconds: 18.8968292
Avg: 20.7847136

I confirmed this patch passed all tests in t/ with core_fscache=1.

Signed-off-by: Takuto Ikuta <tikuta@chromium.org>
2026-06-26 08:58:04 +00:00
Jeff Hostetler
73685993c3 add: use preload-index and fscache for performance
Teach "add" to use preload-index and fscache features
to improve performance on very large repositories.

During an "add", a call is made to run_diff_files()
which calls check_remove() for each index-entry.  This
calls lstat().  On Windows, the fscache code intercepts
the lstat() calls and builds a private cache using the
FindFirst/FindNext routines, which are much faster.

Somewhat independent of this, is the preload-index code
which distributes some of the start-up costs across
multiple threads.

We need to keep the call to read_cache() before parsing the
pathspecs (and hence cannot use the pathspecs to limit any preload)
because parse_pathspec() is using the index to determine whether a
pathspec is, in fact, in a submodule. If we would not read the index
first, parse_pathspec() would not error out on a path that is inside
a submodule, and t7400-submodule-basic.sh would fail with

	not ok 47 - do not add files from a submodule

We still want the nice preload performance boost, though, so we simply
call read_cache_preload(&pathspecs) after parsing the pathspecs.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:58:04 +00:00
Karsten Blees
6cabec8b8c mingw: add infrastructure for read-only file system level caches
Add a macro to mark code sections that only read from the file system,
along with a config option and documentation.

This facilitates implementation of relatively simple file system level
caches without the need to synchronize with the file system.

Enable read-only sections for 'git status' and preload_index.

Signed-off-by: Karsten Blees <blees@dcon.de>
2026-06-26 08:58:04 +00:00
Johannes Schindelin
f028e7203e Continue improving support for 4GB+ packs/clones/objects (#6289)
This PR contains a branch thicket on top of v2.55.0-rc1 (i.e. ready to
go upstream) to continue the bulk of the `unsigned long` -> `size_t`
transformation.

Since all of these changes have no impact on the currently-working
functionality for <4GB objects/packs/clones (modulo bugs, that is 😄), I
would like to merge this before v2.55.0-rc2, still: The risk of
introducing a regression is negligible, the chance for fixing the
majority of problems with large clones is high.
2026-06-26 08:58:03 +00:00
Johannes Schindelin
442d369390 Merge branch 'dont-clean-junctions'
This topic branch teaches `git clean` to respect NTFS junctions and Unix
bind mounts: it will now stop at those boundaries.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:58:02 +00:00
Johannes Schindelin
3ad6222635 Merge pull request #1897 from piscisaureus/symlink-attr
Specify symlink type in .gitattributes
2026-06-26 08:58:02 +00:00
Johannes Schindelin
f1ba0e8c15 credential-cache: handle ECONNREFUSED gracefully (#5329)
I should probably add some tests for this.
2026-06-26 08:58:01 +00:00
Johannes Schindelin
e36cd7e38a fast-import: drop the six size casts in the object-read paths
Continue the size_t evacuation. fast-import's helper
gfi_unpack_entry() and the five size-handling sites that feed off
it (store_object()'s deltalen, load_tree(), parse_from_existing(),
the inline gfi_unpack_entry() caller in parse_objectish(),
cat_blob(), and dereference()) all carry size_t-shaped values from
the odb / unpack_entry() APIs through cast_size_t_to_ulong()
bridges into unsigned long locals.

With the producers (odb_read_object(), odb_read_object_peeled(),
unpack_entry()) and the consumers it feeds (the zlib avail_in
field from a prior commit, encode_in_pack_object_header()'s
uintmax_t parameter, parse_from_commit()'s widened size parameter)
all size_t-ready, the bridges and casts go away in one pass.
gfi_unpack_entry() now writes into the caller's size_t directly,
and the six locals collapse to plain size_t declarations.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:55 +00:00
Johannes Schindelin
624ede0fb4 pack-objects: drop the last size shim in write_no_reuse_object()
Continue the size_t evacuation that this series and the merged
js/objects-larger-than-4gb-on-windows topic are advancing for
>4 GiB objects on Windows: with the odb readers and the zlib
helpers reached from do_compress() now widened end-to-end, the
last cast_size_t_to_ulong() shim in this function can be removed,
and do_compress() itself can carry the new size type through.

Two cast_size_t_to_ulong() shims remain in this file; they feed
the tree-walk API, which is still narrow and is a separate
widening topic.

write_no_reuse_object()'s return type and the hashfile API are
still narrow but unchanged in observable behaviour: on 64-bit
Linux ulong coincides with size_t, and on Windows these were the
narrow fenceposts the prior topics deliberately left in place.
Their widening is left to follow-ups touching the hashfile API
and the write_object() caller chain.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:55 +00:00
Johannes Schindelin
4a47d0787c pack-objects: drop cast_size_t_to_ulong shims in try_delta()
Companion to the prior get_delta() cleanup, and the last try_delta()
piece of the >4 GiB delta-path topic. Every consumer that the
function's locals fed has now been widened: SIZE() / DELTA_SIZE() to
size_t (prior topic), the mem_usage out-parameter and delta_cacheable()
earlier in this series, and create_delta() / create_delta_index() in
the immediately preceding commits.

Widen the declaration of trg_size, src_size, sizediff, max_size and
sz to size_t (delta_size joins them on the same line, removing the
size_t delta_size line that the create_delta() widening commit added
as a stop-gap), and drop the two sz_st bridge variables together with
the surrounding cast_size_t_to_ulong() calls. The result is just
"odb_read_object(&sz)" on both reads.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:55 +00:00
Johannes Schindelin
da17ddcf28 pack-objects: drop cast_size_t_to_ulong shims in get_delta()
The two shims that 606c192380 (odb, packfile: use size_t for
streaming object sizes, 2026-05-08) and the subsequent
odb_read_object() widening introduced as scaffolding around
get_delta()'s reads can now disappear: the previous commit widened
diff_delta() to size_t, which was the last narrow consumer in this
function.

Widen size and base_size to size_t outright, drop the size_st /
base_size_st bridging temporaries, and drop the two
cast_size_t_to_ulong() calls. Net change is 4 lines smaller and one
read-then-cast indirection gone from each odb read.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:55 +00:00
Johannes Schindelin
6b2643abce Merge branch 'size-t/unpack-objects' 2026-06-26 08:57:54 +00:00
Johannes Schindelin
77f84ccac1 Merge branch 'size-t/repo' 2026-06-26 08:57:54 +00:00
Johannes Schindelin
9605858656 Merge branch 'size-t/fast-export' 2026-06-26 08:57:54 +00:00
Johannes Schindelin
67eb2fe4a0 Merge branch 'size-t/commit' 2026-06-26 08:57:54 +00:00
Johannes Schindelin
9a9bc5c853 Merge branch 'size-t/tree' 2026-06-26 08:57:54 +00:00
Johannes Schindelin
6cd3029a24 clean: remove mount points when possible
Windows' equivalent to "bind mounts", NTFS junction points, can be
unlinked without affecting the mount target. This is clearly what users
expect to happen when they call `git clean -dfx` in a worktree that
contains NTFS junction points: the junction should be removed, and the
target directory of said junction should be left alone (unless it is
inside the worktree).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
fa930f9e06 unpack-objects: widen the size-passing infrastructure to size_t
Drop the last cast_size_t_to_ulong() in builtin/unpack-objects.c.
With size_t-typed object sizes already coming in via odb_read_object()
and the per-byte varint decode in unpack_one() (widened by
f2063855fb), the rest of the file was the only thing left that still
threaded sizes through unsigned long: struct obj_buffer.size and
struct delta_info.size, get_data() and add_object_buffer(),
add_delta_to_list(), resolve_delta(), resolve_against_held(),
added_object(), write_object(), unpack_non_delta_entry(),
unpack_delta_entry(), and stream_blob().

Widen all of them together. None of those types had a downstream
narrow consumer once odb_write_object() and patch_delta() were
widened earlier, so the change is mechanical: parameter and field
types change, the base_size_st bridge in unpack_delta_entry() and
its cast go away, and odb_read_object() now writes into base_size
directly.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
e31bb25933 repo: drop the inflated-size cast in count_objects()
Continue the size_t evacuation. count_objects() feeds the inflated
size from odb_read_object_info_extended()'s size_t out-parameter
into struct object_values (size_t) and check_largest() (size_t)
through an unsigned long bridge with a cast_size_t_to_ulong()
shim. The bridge was the only narrow link in the chain. Widen the
local, point oi.sizep at it directly, and drop the cast.

parse_object_buffer() still takes unsigned long, so a Windows
narrowing remains at that one call; that is its own follow-up
topic.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
bd956e9dd4 fast-export: drop the export_blob() size cast and widen anonymize_blob()
Mirror of the preceding fast-import sweep. anonymize_blob() writes
strbuf.len (size_t) into its out-parameter, and export_blob()'s
non-anonymize branch reads odb_read_object()'s size_t out-parameter
through a size_st + cast_size_t_to_ulong() bridge into an unsigned
long local; both have been silent on Windows past 4 GiB. Widen the
helper signature and the local, and drop the bridge.

check_object_signature() and parse_object_buffer() still take
unsigned long, so the silent narrowing on Windows just moves from
the local assignment to those call sites; both are separate topics.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
ea9548a2de commit: widen the commit-buffer API to size_t
Continue the migration from `unsigned long` to `size_t`. The `size`
attribute of `struct commit_buffer` is fed either from
`odb_read_object()`'s return value (`size_t`, handled with
`cast_size_t_to_ulong()`) or from `strbuf.len` in
`fake_working_tree_commit()` (silently narrowed today). Widen the field
and a couple of function signatures together, drop the shim in
`repo_get_commit_buffer()`, and move the matching `unsigned long` locals
at the in-tree callers in commit.c (three sites), builtin/replace.c, and
builtin/stash.c (two sites). The remaining callers pass NULL or already
pass a size_t-compatible variable.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
e535a93014 clean: do not traverse mount points
It seems to be not exactly rare on Windows to install NTFS junction
points (the equivalent of "bind mounts" on Linux/Unix) in worktrees,
e.g. to map some development tools into a subdirectory.

In such a scenario, it is pretty horrible if `git clean -dfx` traverses
into the mapped directory and starts to "clean up".

Let's just not do that. Let's make sure before we traverse into a
directory that it is not a mount point (or junction).

This addresses https://github.com/git-for-windows/git/issues/607

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
824c02f12f pack-objects: drop the two tree-walk casts in the preferred-base path
With init_tree_desc() widened in the prior commit, the
size_t-returning odb_read_object_peeled() call in
add_preferred_base() and odb_read_object() call in pbase_tree_get()
can both flow straight through to init_tree_desc() and into the
pbase_tree_cache. Widen pbase_tree_cache.tree_size and the two
local size variables to size_t, drop the size_st bridges, and drop
the two cast_size_t_to_ulong() shims.

This was the last pair of cast_size_t_to_ulong() call sites in
builtin/pack-objects.c, completing the >4 GiB-objects work in that
file that this branch and its predecessors have been pursuing.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
eca45d90fb diff: widen textconv_object() size out-param to size_t
Continue the size_t evacuation. textconv_object() fills its
out-parameter from fill_textconv()'s size_t return through an
unsigned long*; widen the API to match, then take advantage of the
new shape where callers can.

cat-file's 'c' and batch-mode 'c' branches lose their size_ul
bridge variables (one site becomes a direct call, the other
collapses an if/else into a single negated condition that reads as
"try textconv, fall back to a raw read").

blame.c likewise drops the file_size_st bridge in fill_origin_blob()
and hoists final_buf_size_st to bracket both branches in
setup_scoreboard(). The latter keeps a cast_size_t_to_ulong() shim
because struct blame_scoreboard.final_buf_size is still unsigned
long; that field is its own topic.

log.c just widens its local from unsigned long to size_t.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
52da019eaa Introduce helper to create symlinks that knows about index_state
On Windows, symbolic links actually have a type depending on the target:
it can be a file or a directory.

In certain circumstances, this poses problems, e.g. when a symbolic link
is supposed to point into a submodule that is not checked out, so there
is no way for Git to auto-detect the type.

To help with that, we will add support over the course of the next
commits to specify that symlink type via the Git attributes. This
requires an index_state, though, something that Git for Windows'
`symlink()` replacement cannot know about because the function signature
is defined by the POSIX standard and not ours to change.

So let's introduce a helper function to create symbolic links that
*does* know about the index_state.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
1d0e6f9746 packfile, git-zlib: widen use_pack() and zstream avail fields to size_t
Bundling the two widenings: four call sites pass &stream.avail_in
directly to use_pack(), and widening either type fencepost alone
would force a bridge variable at each. Doing both together is the
simpler end state and is the prerequisite for the do_compress()
widening in the next commit, which is what lets
write_no_reuse_object() lose its last cast_size_t_to_ulong() shim.

The unsigned-long locals widened at the other use_pack() callers
(avail / remaining / left) hold pack-window sizes bounded by
core.packedGitWindowSize, so the change is type consistency rather
than a new >4GB capability. git_zstream.avail_in / avail_out
likewise reach zlib's uInt fields only after zlib_buf_cap()'s 1 GiB
cap, so the wrapper already accepted size_t-shaped inputs in
practice.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
2f8a320527 delta: widen create_delta() and diff_delta() to size_t
Last stop in the delta-encoding API widening for >4 GiB blobs on
Windows: with create_delta_index() done in the prior commit and
create_delta()/diff_delta() finished here, every byte count that
crosses delta.h is now size_t. The struct fields they store into
have been size_t since the diff-delta struct widening.

The API change must move with all callers in the same commit (the
build only passes when every &delta_size matches the new size_t*).
Caller updates are kept minimal:

  * builtin/pack-objects.c get_delta() and try_delta(): widen only
    the local delta_size variable; the surrounding unsigned-long
    locals and their cast_size_t_to_ulong() shims are out of scope
    here and will be cleaned up in their own commits.

  * builtin/fast-import.c, diff.c, t/helper/test-pack-deltas.c:
    keep the local unsigned-long delta size (each feeds a still-
    unsigned-long downstream consumer: zlib's avail_in,
    deflate_it(), the test helper's own do_compress()), and bridge
    via a temporary size_t plus cast_size_t_to_ulong(). The new
    casts are paid back in later topics that widen those consumers.

  * t/helper/test-delta.c: widen the local outright (no downstream
    consumer beyond the test's own out_size, which is already
    size_t).

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
f9a6df6aba pack-objects: widen mem_usage and try_delta out-param to size_t
The pair must move together because find_deltas() passes &mem_usage
to try_delta(): widening either alone breaks the type match.

mem_usage accumulates per-object byte counts already computed in
size_t (SIZE() and sizeof_delta_index() reach here through
free_unpacked(), now size_t), and was the last 32-bit-on-Windows
narrowing point in the delta-window memory accounting chain. With
this commit, that chain is internally size_t end-to-end except for
sizeof_delta_index()'s still-narrow return, whose value is bounded
by create_delta_index()'s entries cap.

window_memory_limit (config-driven via git_config_ulong()) stays
unsigned long: it is only compared against mem_usage and promotes.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
819d8a1cfd pack-objects: widen free_unpacked() return to size_t
free_unpacked() sums two byte counts: sizeof_delta_index() and
SIZE(n->entry). The latter has been size_t since the prior topic
"More work supporting objects larger than 4GB on Windows" widened
SIZE() / oe_size() to size_t, so accumulating it into an unsigned
long return was a silent Windows-only truncation on a packing run
with many large objects.

The sole caller (find_deltas()) holds its own mem_usage in an
unsigned long for now and subtracts the return into it, so the new
narrowing happens at that subtraction. find_deltas() and the
matching try_delta() out-parameter are widened next.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Johannes Schindelin
848acf183c pack-objects: widen delta-cache accounting to size_t
These three are a single accounting tuple (the globals tracking
cumulative cached-delta bytes, plus the helper that compares them
against an incoming delta size) and are latently 32-bit on Windows
where unsigned long != size_t: a pack with many large cached deltas
could wrap silently.

The widening is internally consistent on its own: the additions and
subtractions against delta_cache_size already come from size_t
sources (DELTA_SIZE() returns size_t), and delta_cacheable()'s sole
caller in try_delta() still passes unsigned long, which promotes.

Prerequisite for dropping try_delta()'s cast_size_t_to_ulong()
shims, which becomes possible once create_delta() and diff_delta()
are widened in a later commit.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:53 +00:00
Matthias Aßhauer
17f8832136 credential-cache: handle ECONNREFUSED gracefully
In 245670c (credential-cache: check for windows specific errors, 2021-09-14)
we concluded that on Windows we would always encounter ENETDOWN where we
would expect ECONNREFUSED on POSIX systems, when connecting to unix sockets.
As reported in [1], we do encounter ECONNREFUSED on Windows if the
socket file doesn't exist, but the containing directory does and ENETDOWN if
neither exists. We should handle this case like we do on non-windows systems.

[1] https://github.com/git-for-windows/git/pull/4762#issuecomment-2545498245

This fixes https://github.com/git-for-windows/git/issues/5314

Helped-by: M Hickford <mirth.hickford@gmail.com>
Signed-off-by: Matthias Aßhauer <mha1993@live.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:52 +00:00
Johannes Schindelin
e223d7c2d7 survey: clearly note the experimental nature in the output
While this command is definitely something we _want_, chances are that
upstreaming this will require substantial changes.

We still want to be able to experiment with this before that, to focus
on what we need out of this command: To assist with diagnosing issues
with large repositories, as well as to help monitoring the growth and
the associated painpoints of such repositories.

To that end, we are about to integrate this command into
`microsoft/git`, to get the tool into the hands of users who need it
most, with the idea to iterate in close collaboration between these
users and the developers familar with Git's internals.

However, we will definitely want to avoid letting anybody have the
impression that this command, its exact inner workings, as well as its
output format, are anywhere close to stable. To make that fact utterly
clear (and thereby protect the freedom to iterate and innovate freely
before upstreaming the command), let's mark its output as experimental
in all-caps, as the first thing we do.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:52 +00:00
Derrick Stolee
39e3d7c1d0 survey: add --top=<N> option and config
The 'git survey' builtin provides several detail tables, such as "top
files by on-disk size". The size of these tables defaults to 10,
currently.

Allow the user to specify this number via a new --top=<N> option or the
new survey.top config key.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:52 +00:00
Derrick Stolee
9f1a18302e survey: add report of "largest" paths
Since we are already walking our reachable objects using the path-walk API,
let's now collect lists of the paths that contribute most to different
metrics. Specifically, we care about

 * Number of versions.
 * Total size on disk.
 * Total inflated size (no delta or zlib compression).

This information can be critical to discovering which parts of the
repository are causing the most growth, especially on-disk size. Different
packing strategies might help compress data more efficiently, but the toal
inflated size is a representation of the raw size of all snapshots of those
paths. Even when stored efficiently on disk, that size represents how much
information must be processed to complete a command such as 'git blame'.

The exact disk size seems to be not quite robust enough for testing, as
could be seen by the `linux-musl-meson` job consistently failing, possibly
because of zlib-ng deflates differently: t8100.4(git survey
(default)) was failing with a symptom like this:

   TOTAL OBJECT SIZES BY TYPE
   ===============================================
   Object Type | Count | Disk Size | Inflated Size
   ------------+-------+-----------+--------------
  -    Commits |    10 |      1523 |          2153
  +    Commits |    10 |      1528 |          2153
         Trees |    10 |       495 |          1706
         Blobs |    10 |       191 |           101
  -       Tags |     4 |       510 |           528
  +       Tags |     4 |       547 |           528

This means: the disk size is unlikely something we can verify robustly.
Since zlib-ng seems to increase the disk size of the tags from 528 to
547, we cannot even assume that the disk size is always smaller than the
inflated size. We will most likely want to either skip verifying the
disk size altogether, or go for some kind of fuzzy matching, say, by
replacing `s/ 1[45][0-9][0-9] / ~1.5k /` and `s/ [45][0-9][0-9] / ~½k /`
or something like that.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:52 +00:00
Derrick Stolee
6c582555bb survey: add ability to track prioritized lists
In future changes, we will make use of these methods. The intention is to
keep track of the top contributors according to some metric. We don't want
to store all of the entries and do a sort at the end, so track a
constant-size table and remove rows that get pushed out depending on the
chosen sorting algorithm.

Co-authored-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by; Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2026-06-26 08:57:52 +00:00
Derrick Stolee
1dfb8cd108 survey: show progress during object walk
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2026-06-26 08:57:52 +00:00
Derrick Stolee
e10bb93305 survey: summarize total sizes by object type
Now that we have explored objects by count, we can expand that a bit more to
summarize the data for the on-disk and inflated size of those objects. This
information is helpful for diagnosing both why disk space (and perhaps
clone or fetch times) is growing but also why certain operations are slow
because the inflated size of the abstract objects that must be processed is
so large.

Note: zlib-ng is slightly more efficient even at those small sizes. Even
between zlib versions, there are slight differences in compression. To
accommodate for that in the tests, not the exact numbers but some rough
approximations are validated (the test should validate `git survey`,
after all, not zlib).

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2026-06-26 08:57:52 +00:00
Derrick Stolee
343e55fdf1 survey: add object count summary
At the moment, nothing is obvious about the reason for the use of the
path-walk API, but this will become more prevelant in future iterations. For
now, use the path-walk API to sum up the counts of each kind of object.

For example, this is the reachable object summary output for my local repo:

REACHABLE OBJECT SUMMARY
========================
Object Type |  Count
------------+-------
       Tags |   1343
    Commits | 179344
      Trees | 314350
      Blobs | 184030

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2026-06-26 08:57:52 +00:00
Derrick Stolee
1180da077b survey: start pretty printing data in table form
When 'git survey' provides information to the user, this will be presented
in one of two formats: plaintext and JSON. The JSON implementation will be
delayed until the functionality is complete for the plaintext format.

The most important parts of the plaintext format are headers specifying the
different sections of the report and tables providing concreted data.

Create a custom table data structure that allows specifying a list of
strings for the row values. When printing the table, check each column for
the maximum width so we can create a table of the correct size from the
start.

The table structure is designed to be flexible to the different kinds of
output that will be implemented in future changes.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2026-06-26 08:57:52 +00:00
Jeff Hostetler
364af39455 survey: add command line opts to select references
By default we will scan all references in "refs/heads/", "refs/tags/"
and "refs/remotes/".

Add command line opts let the use ask for all refs or a subset of them
and to include a detached HEAD.

Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
2026-06-26 08:57:52 +00:00
Jeff Hostetler
532dd1d2a7 survey: stub in new experimental 'git-survey' command
Start work on a new 'git survey' command to scan the repository
for monorepo performance and scaling problems.  The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.

The initial goal is to complement the scanning and analysis performed
by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.

Co-authored-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2026-06-26 08:57:52 +00:00
Junio C Hamano
511d8b6107 Merge branch 'jc/history-message-prep-fix' into seen
Code clean-up with leakfix for a write file stream.

* jc/history-message-prep-fix:
  history: streamline message preparation and plug file stream leak
2026-06-25 19:51:54 -07:00
Junio C Hamano
dc2a330582 Merge branch 'hn/branch-push-slip-advice' into seen
"git push origin/main" and "git branch origin main" could both be
an obvious typo, in which case offer the obvious typofix.

* hn/branch-push-slip-advice:
  SQUASH??? use test_grep
  push: suggest <remote> <branch> for a slash slip
  branch: suggest <remote>/<branch> on upstream slip
2026-06-25 19:49:56 -07:00
Junio C Hamano
c4bdde67b7 Merge branch 'jt/receive-pack-use-odb-transactions' into seen
git-receive-pack has been refactored to use ODB transaction
interfaces instead of directly managing tmp_objdir for staging
incoming objects, bringing it closer to being ODB backend agnostic.

* jt/receive-pack-use-odb-transactions:
  builtin/receive-pack: stage incoming objects via ODB transactions
  odb/transaction: add transaction env interface
  odb/transaction: propagate commit errors
  odb/transaction: propagate begin errors
  object-file: propagate files transaction errors
  object-file: rename files transaction prepare function
2026-06-25 19:49:56 -07:00