Commit Graph

12776 Commits

Author SHA1 Message Date
Johannes Schindelin
ed886133e6 credential-cache: handle ECONNREFUSED gracefully (#5329)
I should probably add some tests for this.
2025-06-11 08:16:47 +02:00
Johannes Schindelin
acd8c6858e Add experimental 'git survey' builtin (#5174)
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft/git#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
Jeff Hostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2025-06-11 08:16:47 +02:00
Derrick Stolee
81c7a40d30 Add path walk API and its use in 'git pack-objects' (#5171)
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget/git#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
2025-06-11 08:16:46 +02:00
Matthias Aßhauer
45289afc36 credential-cache: handle ECONNREFUSED gracefully
In 245670c (credential-cache: check for windows specific errors, 2021-09-14)
we concluded that on Windows we would always encounter ENETDOWN where we
would expect ECONNREFUSED on POSIX systems, when connecting to unix sockets.
As reported in [1], we do encounter ECONNREFUSED on Windows if the
socket file doesn't exist, but the containing directory does and ENETDOWN if
neither exists. We should handle this case like we do on non-windows systems.

[1] https://github.com/git-for-windows/git/pull/4762#issuecomment-2545498245

This fixes https://github.com/git-for-windows/git/issues/5314

Helped-by: M Hickford <mirth.hickford@gmail.com>
Signed-off-by: Matthias Aßhauer <mha1993@live.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2025-06-11 08:16:35 +02:00
Johannes Schindelin
a306d2b759 survey: clearly note the experimental nature in the output
While this command is definitely something we _want_, chances are that
upstreaming this will require substantial changes.

We still want to be able to experiment with this before that, to focus
on what we need out of this command: To assist with diagnosing issues
with large repositories, as well as to help monitoring the growth and
the associated painpoints of such repositories.

To that end, we are about to integrate this command into
`microsoft/git`, to get the tool into the hands of users who need it
most, with the idea to iterate in close collaboration between these
users and the developers familar with Git's internals.

However, we will definitely want to avoid letting anybody have the
impression that this command, its exact inner workings, as well as its
output format, are anywhere close to stable. To make that fact utterly
clear (and thereby protect the freedom to iterate and innovate freely
before upstreaming the command), let's mark its output as experimental
in all-caps, as the first thing we do.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2025-06-11 08:16:34 +02:00
Derrick Stolee
bb27ee4887 pack-objects: thread the path-based compression
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                                    HEAD~1    HEAD
-------------------------------------------------------------
5313.6: thin pack with --path-walk        0.01    0.01  +0.0%
5313.7: thin pack size with --path-walk    475     475  +0.0%
5313.12: big pack with --path-walk        1.99    1.87  -6.0%
5313.13: big pack size with --path-walk  14.4M   14.3M  -0.4%
5313.18: repack with --path-walk         98.14   41.46 -57.8%
5313.19: repack size with --path-walk   197.2M  197.3M  +0.0%

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:34 +02:00
Derrick Stolee
66c44c5fed survey: add --top=<N> option and config
The 'git survey' builtin provides several detail tables, such as "top
files by on-disk size". The size of these tables defaults to 10,
currently.

Allow the user to specify this number via a new --top=<N> option or the
new survey.top config key.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2025-06-11 08:16:34 +02:00
Derrick Stolee
5185b1c080 survey: add report of "largest" paths
Since we are already walking our reachable objects using the path-walk API,
let's now collect lists of the paths that contribute most to different
metrics. Specifically, we care about

 * Number of versions.
 * Total size on disk.
 * Total inflated size (no delta or zlib compression).

This information can be critical to discovering which parts of the
repository are causing the most growth, especially on-disk size. Different
packing strategies might help compress data more efficiently, but the toal
inflated size is a representation of the raw size of all snapshots of those
paths. Even when stored efficiently on disk, that size represents how much
information must be processed to complete a command such as 'git blame'.

The exact disk size seems to be not quite robust enough for testing, as
could be seen by the `linux-musl-meson` job consistently failing, possibly
because of zlib-ng deflates differently: t8100.4(git survey
(default)) was failing with a symptom like this:

   TOTAL OBJECT SIZES BY TYPE
   ===============================================
   Object Type | Count | Disk Size | Inflated Size
   ------------+-------+-----------+--------------
  -    Commits |    10 |      1523 |          2153
  +    Commits |    10 |      1528 |          2153
         Trees |    10 |       495 |          1706
         Blobs |    10 |       191 |           101
  -       Tags |     4 |       510 |           528
  +       Tags |     4 |       547 |           528

This means: the disk size is unlikely something we can verify robustly.
Since zlib-ng seems to increase the disk size of the tags from 528 to
547, we cannot even assume that the disk size is always smaller than the
inflated size. We will most likely want to either skip verifying the
disk size altogether, or go for some kind of fuzzy matching, say, by
replacing `s/ 1[45][0-9][0-9] / ~1.5k /` and `s/ [45][0-9][0-9] / ~½k /`
or something like that.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2025-06-11 08:16:34 +02:00
Derrick Stolee
d27d17c819 survey: add ability to track prioritized lists
In future changes, we will make use of these methods. The intention is to
keep track of the top contributors according to some metric. We don't want
to store all of the entries and do a sort at the end, so track a
constant-size table and remove rows that get pushed out depending on the
chosen sorting algorithm.

Co-authored-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by; Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:34 +02:00
Derrick Stolee
9124e8f28b survey: show progress during object walk
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:34 +02:00
Derrick Stolee
3564cae451 survey: summarize total sizes by object type
Now that we have explored objects by count, we can expand that a bit more to
summarize the data for the on-disk and inflated size of those objects. This
information is helpful for diagnosing both why disk space (and perhaps
clone or fetch times) is growing but also why certain operations are slow
because the inflated size of the abstract objects that must be processed is
so large.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:34 +02:00
Derrick Stolee
1422871dc0 survey: add object count summary
At the moment, nothing is obvious about the reason for the use of the
path-walk API, but this will become more prevelant in future iterations. For
now, use the path-walk API to sum up the counts of each kind of object.

For example, this is the reachable object summary output for my local repo:

REACHABLE OBJECT SUMMARY
========================
Object Type |  Count
------------+-------
       Tags |   1343
    Commits | 179344
      Trees | 314350
      Blobs | 184030

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:34 +02:00
Derrick Stolee
d7908330e2 survey: start pretty printing data in table form
When 'git survey' provides information to the user, this will be presented
in one of two formats: plaintext and JSON. The JSON implementation will be
delayed until the functionality is complete for the plaintext format.

The most important parts of the plaintext format are headers specifying the
different sections of the report and tables providing concreted data.

Create a custom table data structure that allows specifying a list of
strings for the row values. When printing the table, check each column for
the maximum width so we can create a table of the correct size from the
start.

The table structure is designed to be flexible to the different kinds of
output that will be implemented in future changes.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:34 +02:00
Jeff Hostetler
2f7e431d41 survey: add command line opts to select references
By default we will scan all references in "refs/heads/", "refs/tags/"
and "refs/remotes/".

Add command line opts let the use ask for all refs or a subset of them
and to include a detached HEAD.

Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:34 +02:00
Jeff Hostetler
a7fec1b67a survey: stub in new experimental 'git-survey' command
Start work on a new 'git survey' command to scan the repository
for monorepo performance and scaling problems.  The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.

The initial goal is to complement the scanning and analysis performed
by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.

Co-authored-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:34 +02:00
Derrick Stolee
0c24151f73 pack-objects: refactor path-walk delta phase
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but could be
done in a future update.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:33 +02:00
Derrick Stolee
9cd9f31437 pack-objects: enable --path-walk via config
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:33 +02:00
Derrick Stolee
125101f7c3 repack: add --path-walk option
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the results are very interesting:

Test                                           this tree
------------------------------------------------------------------
5313.2: thin pack                              0.40(0.47+0.04)
5313.3: thin pack size                                    1.2M
5313.4: thin pack with --full-name-hash        0.09(0.10+0.04)
5313.5: thin pack size with --full-name-hash             22.8K
5313.6: thin pack with --path-walk             0.08(0.06+0.02)
5313.7: thin pack size with --path-walk                  20.8K
5313.8: big pack                               2.16(8.43+0.23)
5313.9: big pack size                                    17.7M
5313.10: big pack with --full-name-hash        1.42(3.06+0.21)
5313.11: big pack size with --full-name-hash             18.0M
5313.12: big pack with --path-walk             2.21(8.39+0.24)
5313.13: big pack size with --path-walk                  17.8M
5313.14: repack                                98.05(662.37+2.64)
5313.15: repack size                                    449.1K
5313.16: repack with --full-name-hash          33.95(129.44+2.63)
5313.17: repack size with --full-name-hash              182.9K
5313.18: repack with --path-walk               106.21(121.58+0.82)
5313.19: repack size with --path-walk                   159.6K

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

This repo suffers from having a lot of paths that collide in the name
hash, so examining them in groups by path leads to better deltas. Also,
in this case, the single-threaded implementation is competitive with the
full repack. This is saving time diffing files that have significant
differences from each other.

A similar, but private, repo has even more extremes in the thin packs:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              2.39(2.91+0.10)
5313.3: thin pack size                                    4.5M
5313.4: thin pack with --full-name-hash        0.29(0.47+0.12)
5313.5: thin pack size with --full-name-hash             15.5K
5313.6: thin pack with --path-walk             0.35(0.31+0.04)
5313.7: thin pack size with --path-walk                  14.2K

Notice, however, that while the --full-name-hash version is working
quite well in these cases for the thin pack, it does poorly for some
other standard cases, such as this test on the Linux kernel repository:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              0.01(0.00+0.00)
5313.3: thin pack size                                     310
5313.4: thin pack with --full-name-hash        0.00(0.00+0.00)
5313.5: thin pack size with --full-name-hash              1.4K
5313.6: thin pack with --path-walk             0.00(0.00+0.00)
5313.7: thin pack size with --path-walk                    310

Here, the --full-name-hash option does much worse than the default name
hash, but the path-walk option does exactly as well.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:33 +02:00
Derrick Stolee
2c32add904 pack-objects: introduce GIT_TEST_PACK_PATH_WALK
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t0411-clone-from-partial.sh : One test fetches from a repo that does
   not have the boundary objects. This causes the path-based walk to
   fail. Disable the variable for this test.

 - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
   without a boundary object.

 - t5310-pack-bitmaps.sh : One test compares the case when packing with
   bitmaps to the case when packing without them. Since we disable the
   test variable when writing bitmaps, this causes a difference in the
   object list (the --path-walk option adds an extra object). Specify
   --no-path-walk in both processes for the comparison. Another test
   checks for a specific delta base, but when computing dynamically
   without using bitmaps, the base object it too small to be considered
   in the delta calculations so no base is used.

 - t5316-pack-delta-depth.sh : This script cares about certain delta
   choices and their chain lengths. The --path-walk option changes how
   these chains are selected, and thus changes the results of this test.

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

 - t5332-multi-pack-reuse.sh : This test verifies that the preferred
   pack is used for delta reuse when possible. The --path-walk option is
   not currently aware of the preferred pack at all, so finds a
   different delta base.

 - t7406-submodule-update.sh : When using the variable, the --depth
   option collides with the --path-walk feature, resulting in a warning
   message. Disable the variable so this warning does not appear.

I want to call out one specific test change that is only temporary:

 - t5530-upload-pack-error.sh : One test cares specifically about an
   "unable to read" error message. Since the current implementation
   performs delta calculations within the path-walk API callback, a
   different "unable to get size" error message appears. When this
   is changed in a future refactoring, this test change can be reverted.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:33 +02:00
Derrick Stolee
b12b4368b1 pack-objects: add --path-walk option
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

RFC TODO: It is important to note that this option is inherently
incompatible with using a bitmap index. This walk probably also does not
work with other advanced features, such as delta islands.

Getting ahead of myself, this option compares well with --full-name-hash
when the packfile is large enough, but also performs at least as well as
the default in all cases that I've seen.

RFC TODO: this should probably be recording the batch locations to another
list so they could be processed in a second phase using threads.

RFC TODO: list some examples of how this outperforms previous pack-objects
strategies. (This is coming in later commits that include performance
test changes.)

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:33 +02:00
Derrick Stolee
2faba74266 pack-objects: extract should_attempt_deltas()
This will be helpful in a future change.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-06-11 08:16:33 +02:00
Johannes Schindelin
11467abb92 clean: remove mount points when possible
Windows' equivalent to "bind mounts", NTFS junction points, can be
unlinked without affecting the mount target. This is clearly what users
expect to happen when they call `git clean -dfx` in a worktree that
contains NTFS junction points: the junction should be removed, and the
target directory of said junction should be left alone (unless it is
inside the worktree).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2025-06-11 08:08:35 +02:00
Johannes Schindelin
72dc1a5d55 clean: do not traverse mount points
It seems to be not exactly rare on Windows to install NTFS junction
points (the equivalent of "bind mounts" on Linux/Unix) in worktrees,
e.g. to map some development tools into a subdirectory.

In such a scenario, it is pretty horrible if `git clean -dfx` traverses
into the mapped directory and starts to "clean up".

Let's just not do that. Let's make sure before we traverse into a
directory that it is not a mount point (or junction).

This addresses https://github.com/git-for-windows/git/issues/607

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2025-06-11 08:08:35 +02:00
Junio C Hamano
d9a1e51c76 Merge branch 'bs/total-ram-bsd'
Update total_ram() functrion on BSD variants.

* bs/total-ram-bsd:
  builtin/gc: correct physical memory detection for OpenBSD / NetBSD
2025-06-03 08:55:24 -07:00
Brad Smith
35c1d592cd builtin/gc: correct physical memory detection for OpenBSD / NetBSD
OpenBSD / NetBSD use HW_PHYSMEM64 to detect the amount of physical
memory in a system. HW_PHYSMEM will not provide the correct amount
on a system with >=4GB of memory.

Signed-off-by: Brad Smith <brad@comstyle.com>
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-06-01 19:01:07 -07:00
Junio C Hamano
0b4c6baa70 fast-export: --signed-commits is experimental
As the design of signature handling is still being discussed, it is
likely that the data stream produced by the code in Git 2.50 would
have to be changed in such a way that is not backward compatible.

Mark the feature as experimental and discourge its use for now.

Also flip the default on the generation side to "strip"; users of
existing versions would not have passed --signed-commits=strip and
will be broken by this change if the default is made to abort, and
will be encouraged by the error message to produce data stream with
future breakage guarantees by passing --signed-commits option.

As we tone down the default behaviour, we no longer need the
FAST_EXPORT_SIGNED_COMMITS_NOABORT environment variable, which was
not discoverable enough.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-28 10:30:47 -07:00
Junio C Hamano
b4847a4477 Merge branch 'jt/receive-pack-skip-connectivity-check'
"git receive-pack" optionally learns not to care about connectivity
check, which can be useful when the repository arranges to ensure
connectivity by some other means.

* jt/receive-pack-skip-connectivity-check:
  builtin/receive-pack: add option to skip connectivity check
  t5410: test receive-pack connectivity check
2025-05-28 07:59:56 -07:00
Junio C Hamano
f9cdaa2860 Merge branch 'js/misc-fixes'
Assorted fixes for issues found with CodeQL.

* js/misc-fixes:
  sequencer: stop pretending that an assignment is a condition
  bundle-uri: avoid using undefined output of `sscanf()`
  commit-graph: avoid using stale stack addresses
  trace2: avoid "futile conditional"
  Avoid redundant conditions
  fetch: avoid unnecessary work when there is no current branch
  has_dir_name(): make code more obvious
  upload-pack: rename `enum` to reflect the operation
  commit-graph: avoid malloc'ing a local variable
  fetch: carefully clear local variable's address after use
  commit: simplify code
2025-05-27 13:59:11 -07:00
Junio C Hamano
6e5fb398d3 Merge branch 'ds/sparse-apply-add-p'
"git apply" and "git add -i/-p" code paths no longer unnecessarily
expand sparse-index while working.

* ds/sparse-apply-add-p:
  p2000: add performance test for patch-mode commands
  reset: integrate sparse index with --patch
  git add: make -p/-i aware of sparse index
  apply: integrate with the sparse index
2025-05-27 13:59:09 -07:00
Junio C Hamano
f545f401be Merge branch 'en/merge-tree-check'
"git merge-tree" learned an option to see if it resolves cleanly
without actually creating a result.

* en/merge-tree-check:
  merge-tree: add a new --quiet flag
  merge-ort: add a new mergeability_only option
2025-05-27 13:59:08 -07:00
Junio C Hamano
17d9dbd3c2 Merge branch 'jk/no-funny-object-types'
Support to create a loose object file with unknown object type has
been dropped.

* jk/no-funny-object-types:
  object-file: drop support for writing objects with unknown types
  hash-object: handle --literally with OPT_NEGBIT
  hash-object: merge HASH_* and INDEX_* flags
  hash-object: stop allowing unknown types
  t: add lib-loose.sh
  t/helper: add zlib test-tool
  oid_object_info(): drop type_name strbuf
  fsck: stop using object_info->type_name strbuf
  oid_object_info_convert(): stop using string for object type
  cat-file: use type enum instead of buffer for -t option
  object-file: drop OBJECT_INFO_ALLOW_UNKNOWN_TYPE flag
  cat-file: make --allow-unknown-type a noop
  object-file.h: fix typo in variable declaration
2025-05-27 13:59:08 -07:00
Junio C Hamano
96d127896d Merge branch 'en/replay-wo-the-repository'
The dependency on the_repository variable has been reduced from the
code paths in "git replay".

* en/replay-wo-the-repository:
  replay: replace the_repository with repo parameter passed to cmd_replay ()
2025-05-23 15:34:08 -07:00
Justin Tobler
68cb0b5253 builtin/receive-pack: add option to skip connectivity check
During git-receive-pack(1), connectivity of the object graph is
validated to ensure that the received packfile does not leave the
repository in a broken state. This is done via git-rev-list(1) and
walking the objects, which can be expensive for large repositories.

Generally, this check is critical to avoid an incomplete received
packfile from corrupting a repository. Server operators may have
additional knowledge though around exactly how Git is being used on the
server-side which can be used to facilitate more efficient connectivity
computation of incoming objects.

For example, if it can be ensured that all objects in a repository are
connected and do not depend on any missing objects, the connectivity of
newly written objects can be checked by walking the object graph
containing only the new objects from the updated tips and identifying
the missing objects which represent the boundary between the new objects
and the repository. These boundary objects can be checked in the
canonical repository to ensure the new objects connect as expected and
thus avoid walking the rest of the object graph.

Git itself cannot make the guarantees required for such an optimization
as it is possible for a repository to contain an unreachable object that
references a missing object without the repository being considered
corrupt.

Introduce the --skip-connectivity-check option for git-receive-pack(1)
which bypasses this connectivity check to give more control to the
server-side. Note that without proper server-side validation of newly
received objects handled outside of Git, usage of this option risks
corrupting a repository.

Signed-off-by: Justin Tobler <jltobler@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-20 11:43:36 -07:00
Junio C Hamano
a9dcacbf2a Merge branch 'jk/oidmap-cleanup'
Code cleanup.

* jk/oidmap-cleanup:
  raw_object_store: drop extra pointer to replace_map
  oidmap: add size function
  oidmap: rename oidmap_free() to oidmap_clear()
2025-05-19 16:02:47 -07:00
Junio C Hamano
6660b42929 Merge branch 'ly/am-split-stgit-leakfix'
Leakfix.

* ly/am-split-stgit-leakfix:
  builtin/am: fix memory leak in `split_mail_stgit_series`
2025-05-19 16:02:46 -07:00
Elijah Newren
29d7bf1951 merge-tree: add a new --quiet flag
Git Forges may be interested in whether two branches can be merged while
not being interested in what the resulting merge tree is nor which files
conflicted.  For such cases, add a new --quiet flag which
will make use of the new mergeability_only flag added to merge-ort in
the previous commit.  This option allows the merge machinery to, in the
outer layer of the merge:
    * exit early when a conflict is detected
    * avoid writing (most) merged blobs/trees to the object store

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 15:09:14 -07:00
Derrick Stolee
efab7dc1f4 reset: integrate sparse index with --patch
Similar to the previous change for 'git add -p', the reset builtin
checked for integration with the sparse index after possibly redirecting
its logic toward the interactive logic. This means that the builtin
would expand the sparse index to a full one upon read.

Move this check earlier within cmd_reset() to improve performance here.

Add tests to guarantee that we are not universally expanding the index.
Add behavior tests to check that we are doing the same operations as a
full index.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 12:02:47 -07:00
Derrick Stolee
02ed8555f6 git add: make -p/-i aware of sparse index
It is slow to expand a sparse index in-memory due to parsing of trees.
We aim to minimize that performance cost when possible. 'git add -p'
uses 'git apply' child processes to modify the index, but still there
are some expansions that occur.

It turns out that control flows out of cmd_add() in the interactive
cases before the lines that confirm that the builtin is integrated with
the sparse index.

Moving that integration point earlier in cmd_add() allows 'git add -i'
and 'git add -p' to operate without expanding a sparse index to a full
one.

Add test cases that confirm that these interactive add options work with
the sparse index.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 12:01:51 -07:00
Derrick Stolee
952de281fe apply: integrate with the sparse index
The sparse index allows storing directory entries in the index, marked
with the skip-wortkree bit and pointing to a tree object. This may be an
unexpected data shape for some implementation areas, so we are rolling
it out incrementally on a builtin-per-builtin basis.

This change enables the sparse index for 'git apply'. The main
motivation for this change is that 'git apply' is used as a child
process of 'git add -p' and expanding the sparse index for each of those
child processes can lead to significant performance issues.

The good news is that the actual index manipulation code used by 'git
apply' is already integrated with the sparse index, so the only product
change is to mark the builtin as allowing the sparse index so it isn't
inflated on read.

The more involved part of this change is around adding tests that verify
how 'git apply' behaves in a sparse-checkout environment and whether or
not the index expands in certain operations.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 12:00:33 -07:00
Jeff King
f710fd7b49 hash-object: handle --literally with OPT_NEGBIT
Since we recently removed the hash_literally() function, the hash-object
--literally option has been simplified to just removing the
INDEX_FORMAT_CHECK flag. Rather than pass it around as a separate bool,
we can just have the option parser remove the bit from the set of flags
directly. This simplifies the helper functions.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 09:43:11 -07:00
Jeff King
931e5ca507 hash-object: merge HASH_* and INDEX_* flags
The hash-object command has its own custom flag bits that it sets based
on command-line options. But since we dropped hash_literally() in the
previous commit, the only thing we do with those flag bits is convert
them directly into "index_flags" to pass to index_fd().

This extra layer of indirection makes the code harder to read and reason
about. Let's just use the INDEX_* flags directly.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 09:43:11 -07:00
Jeff King
65a6a79b42 hash-object: stop allowing unknown types
When passed the "--literally" option, hash-object will allow any
arbitrary string for its "-t" type option. Such objects are only useful
for testing or debugging, as they cannot be used in the normal way
(e.g., you cannot fetch their contents!).

Let's drop this feature, which will eventually let us simplify the
object-writing code. This is technically backwards incompatible, but
since such objects were never really functional, it seems unlikely that
anybody will notice.

We will retain the --literally flag, as it also instructs hash-object
not to worry about other format issues (e.g., type-specific things that
fsck would complain about). The documentation does not need to be
updated, as it was always vague about which checks we're loosening (it
uses only the phrase "any garbage").

The code change is a bit hard to verify from just the patch text. We can
drop our local hash_literally() helper, but it was really just wrapping
write_object_file_literally(). We now replace that with calling
index_fd(), as we do for the non-literal code path, but dropping the
INDEX_FORMAT_CHECK flag. This ends up being the same semantically as
what the _literally() code path was doing (modulo handling unknown
types, which is our goal).

We'll be able to clean up these code paths a bit more in subsequent
patches.

The existing test is flipped to show that we now reject the unknown
type. The additional "extra-long type" test is now redundant, as we bail
early upon seeing a bogus type.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 09:43:11 -07:00
Jeff King
4ae0e9423c fsck: stop using object_info->type_name strbuf
When fsck-ing a loose object, we use object_info's type_name strbuf to
record the parsed object type as a string. For most objects this is
redundant with the object_type enum, but it does let us report the
string when we encounter an object with an unknown type (for which there
is no matching enum value).

There are a few downsides, though:

  1. The code to report these cases is not actually robust. Since we did
     not pass a strbuf to unpack_loose_header(), we only retrieved types
     from headers up to 32 bytes. In longer cases, we'd simply say
     "object corrupt or missing".

  2. This is the last caller that uses object_info's type_name strbuf
     support. It would be nice to refactor it so that we can simplify
     that code.

  3. Likewise, we'll check the hash of the object using its unknown type
     (again, as long as that type is short enough). That depends on the
     hash_object_file_literally() code, which we'd eventually like to
     get rid of.

So we can simplify things by bailing immediately in read_loose_object()
when we encounter an unknown type. This has a few user-visible effects:

  a. Instead of producing a single line of error output like this:

       error: 26ed13ce3564fbbb44e35bde42c7da717ea004a6: object is of unknown type 'bogus': .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6

     we'll now issue two lines (the first from read_loose_object() when
     we see the unparsable header, and the second from the fsck code,
     since we couldn't read the object):

       error: unable to parse type from header 'bogus 4' of .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6
       error: 26ed13ce3564fbbb44e35bde42c7da717ea004a6: object corrupt or missing: .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6

     This is a little more verbose, but this sort of error should be
     rare (such objects are almost impossible to work with, and cannot
     be transferred between repositories as they are not representable
     in packfiles). And as a bonus, reporting the broken header in full
     could help with debugging other cases (e.g., a header like "blob
     xyzzy\0" would fail in parsing the size, but previously we'd not
     have showed the offending bytes).

  b. An object with an unknown type will be reported as corrupt, without
     actually doing a hash check. Again, I think this is unlikely to
     matter in practice since such objects are totally unusable.

We'll update one fsck test to match the new error strings. And we can
remove another test that covered the case of an object with an unknown
type _and_ a hash corruption. Since we'll skip the hash check now in
this case, the test is no longer interesting.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 09:43:10 -07:00
Jeff King
aac2abeca7 cat-file: use type enum instead of buffer for -t option
Now that we no longer support OBJECT_INFO_ALLOW_UNKNOWN_TYPE, there is
no need to pass a strbuf into oid_object_info_extended() to record the
type. The regular object_type enum is sufficient to capture all of the
types we will allow.

This simplifies the code a bit, and will eventually let us drop
object_info's type_name strbuf support.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 09:43:10 -07:00
Jeff King
f227fc7d43 cat-file: make --allow-unknown-type a noop
The cat-file command has some minor support for handling objects with
"unknown" types. I.e., strings that are not "blob", "commit", "tree", or
"tag".

In theory this could be used for debugging or experimenting with
extensions to Git. But in practice this support is not very useful:

  1. You can get the type and size of such objects, but nothing else.
     Not even the contents!

  2. Only loose objects are supported, since packfiles use numeric ids
     for the types, rather than strings.

  3. Likewise you cannot ever transfer objects between repositories,
     because they cannot be represented in the packfiles used for the
     on-the-wire protocol.

The support for these unknown types complicates the object-parsing code,
and has led to bugs such as b748ddb7a4 (unpack_loose_header(): fix
infinite loop on broken zlib input, 2025-02-25). So let's drop it.

The first step is to remove the user-facing parts, which are accessible
only via cat-file. This is technically backwards-incompatible, but given
the limitations listed above, these objects couldn't possibly be useful
in any workflow.

However, we can't just rip out the option entirely. That would hurt a
caller who ran:

  git cat-file -t --allow-unknown-object <oid>

and fed it normal, well-formed objects. There --allow-unknown-type was
doing nothing, but we wouldn't want to start bailing with an error. So
to protect any such callers, we'll retain --allow-unknown-type as a
noop.

The code change is fairly small (but we'll able to clean up more code in
follow-on patches). The test updates drop any use of the option. We
still retain tests that feed the broken objects to cat-file without
--allow-unknown-type, as we should continue to confirm that those
objects are rejected. Note that in one spot we can drop a layer of loop,
re-indenting the body; viewing the diff with "-w" helps there.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 09:43:09 -07:00
Junio C Hamano
4dda60c9df Merge branch 'ps/maintenance-missing-tasks'
Make repository clean-up tasks "gc" can do available to "git
maintenance" front-end.

* ps/maintenance-missing-tasks:
  builtin/maintenance: introduce "rerere-gc" task
  builtin/gc: move rerere garbage collection into separate function
  builtin/maintenance: introduce "worktree-prune" task
  builtin/gc: move pruning of worktrees into a separate function
  builtin/gc: remove global variables where it is trivial to do
  builtin/gc: fix indentation of `cmd_gc()` parameters
2025-05-15 17:24:56 -07:00
Johannes Schindelin
6c91162449 fetch: avoid unnecessary work when there is no current branch
As pointed out by CodeQL, `branch_get()` may return `NULL`, in which
case `branch_has_merge_config()` would return early, but we can even
avoid enumerating the refs prefixes in that case, saving even more CPU
cycles.

Technically, we should enclose these two statements in an `if (branch)
{...}` block, but the indentation is already quite deep, therefore I
refrained from doing that.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-15 13:46:47 -07:00
Johannes Schindelin
c607410ada fetch: carefully clear local variable's address after use
As pointed out by CodeQL, it is a potentially dangerous practice to
store local variables' addresses in non-local structs. Yet this is
exactly what happens with the `acked_commits` attribute that is used in
`cmd_fetch()`: The pointer to a local variable is assigned to it.

Now, it is Git's convention that `cmd_*()` functions are essentially
only returning just before exiting the process, therefore there is
little danger that this attribute is used after the code flow returns
from that function.

However, code in `cmd_*()` function is often so useful that it gets
lifted into a library function, at which point this issue could become a
real problem.

Let's make sure to clear the `acked_commits` attribute out after it was
used, and before the function returns (at which point the address would
go stale).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-15 13:46:45 -07:00
Johannes Schindelin
131a8fa815 commit: simplify code
The difference of two unsigned integers is defined to be unsigned, and
therefore it is misleading to check whether it is greater than zero
(instead, the more natural way would be to check whether the difference
is zero or not).

Let's instead avoid the subtraction altogether, and compare the two
operands directly, which makes the code more obvious as a side effect.

Pointed out by CodeQL's rule with the ID
`cpp/unsigned-difference-expression-compared-zero`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-15 13:46:44 -07:00
Elijah Newren
d2c3e94a0a replay: replace the_repository with repo parameter passed to cmd_replay ()
Replace the_repository everywhere with repo, feed repo from cmd_replay()
to all the other functions in the file that need it, and remove the
UNUSED annotation on repo.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-14 15:00:49 -07:00