159590 Commits

Author SHA1 Message Date
Johannes Schindelin
4f0462d7da reset: reinstate support for the deprecated --stdin option
The `--stdin` option was a well-established paradigm in other commands,
therefore we implemented it in `git reset` for use by Visual Studio.

Unfortunately, upstream Git decided that it is time to introduce
`--pathspec-from-file` instead.

To keep backwards-compatibility for some grace period, we therefore
reinstate the `--stdin` option on top of the `--pathspec-from-file`
option, but mark it firmly as deprecated.

Helped-by: Victoria Dye <vdye@github.com>
Helped-by: Matthew John Cheetham <mjcheetham@outlook.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:51 +02:00
Johannes Schindelin
597abad9b2 Merge branch 'ready-for-upstream'
This is the branch thicket of patches in Git for Windows that are
considered ready for upstream. To keep them in a ready-to-submit shape,
they are kept as close to the beginning of the branch thicket as
possible.
2024-10-03 08:57:35 +02:00
Johannes Schindelin
ccfac7e56e Add experimental 'git survey' builtin (#5174)
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft/git#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
2024-10-03 08:57:34 +02:00
Johannes Schindelin
1cdf89e88b Introduce 'git backfill' to get missing blobs in a partial clone (#5172)
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
2024-10-03 08:57:34 +02:00
Derrick Stolee
ef977e276d Add path walk API and its use in 'git pack-objects' (#5171)
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget/git#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
2024-10-03 08:57:34 +02:00
Johannes Schindelin
c316eab995 pack-objects: create new name-hash algorithm (#5157)
This is an updated version of gitgitgadget/git#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
2024-10-03 08:57:33 +02:00
Johannes Schindelin
05be223493 Merge branch 'run-command-be-helpful-when-Git-LFS-fails-on-Windows-7'
Since Git LFS v3.5.x implicitly dropped Windows 7 support, we now want
users to be advised _what_ is going wrong on that Windows version. This
topic branch goes out of its way to provide users with such guidance.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:33 +02:00
Johannes Schindelin
d5af88fbcc Merge branch 'Fallback-to-AppData-if-XDG-CONFIG-HOME-is-unset'
This topic branch adds support for a more Windows-native user-wide
config file than `XDG_CONFIG_HOME` (or `~/.config/`) will ever be.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:33 +02:00
Johannes Schindelin
a9d2c13795 Merge branch 'Fix-i686-build-with-GCC-v14'
This fixes a long-time compile warning turned error by GCC v14.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:33 +02:00
Johannes Schindelin
3cfb527016 Merge branch 'run-t5601-and-t7406-with-symlinks-on-windows-10'
This topic branch contains a patch that made it into Git for Windows
v2.45.1 but not into Git v2.45.1 (because the latter does not come with
symlink support on Windows).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:33 +02:00
Johannes Schindelin
98b8cfb898 common-main.c: fflush stdout buffer when exit (#4901) 2024-10-03 08:57:32 +02:00
Johannes Schindelin
76ec3f27ac ARM64: Embed manifest properly (#4718)
Teach our ARM64 based builds to embed the manifest file correctly.

This fixes #4707
2024-10-03 08:57:32 +02:00
Johannes Schindelin
bb0eb0b3e5 win32: use native ANSI sequence processing, if possible (#4700)
Windows 10 version 1511 (also known as Anniversary Update), according to
https://learn.microsoft.com/en-us/windows/console/console-virtual-terminal-sequences
introduced native support for ANSI sequence processing. This allows
using colors from the entire 24-bit color range.

All we need to do is test whether the console's "virtual processing
support" can be enabled. If it can, we do not even need to start the
`console_thread` to handle ANSI sequences.

Incidentally, this addresses
https://github.com/git-for-windows/git/issues/3712.
2024-10-03 08:57:32 +02:00
Johannes Schindelin
e44720f77a Additional error checks for issuing the windows.appendAtomically warning (#4528)
Another (hopefully clean) PR for showing the error warning about atomic
append on windows after failure on APFS, which returns EBADF not EINVAL.

Signed-off-by: David Lomas <dl3@pale-eds.co.uk>
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
2024-10-03 08:57:32 +02:00
Johannes Schindelin
32c3fd83c0 Merge branch 'nano-server'
This patch adds a GitHub workflow (to be triggered manually) to allow
for conveniently verifying that Git and Scalar still work as intended in
Windows Nano Server (a relatively small container base image that is
frequently used where a "small Windows" is needed, e.g. in automation
;-))

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:31 +02:00
Johannes Schindelin
2092813b17 Lazy load libcurl, allowing for an SSL/TLS backend-specific libcurl (#4410)
As per
https://github.com/git-for-windows/git/issues/4350#issuecomment-1485041503,
the major block for upgrading Git for Windows' OpenSSL from v1.1 to v3
is the tricky part where such an upgrade would break `git fetch`/`git
clone` and `git push` because the libcurl depends on the OpenSSL DLL,
and the major version bump will _change_ the file name of said DLL.

To overcome that, the plan is to build libcurl flavors for each
supported SSL/TLS backend, aligning with the way MSYS2 builds libcurl,
then switch Git for Windows' SDK to the Secure Channel-flavored libcurl,
and teach Git to look for the specific flavor of libcurl corresponding
to the `http.sslBackend` setting (if that was configured).

Here is the PR to teach Git that trick.
2024-10-03 08:57:31 +02:00
Johannes Schindelin
4ae2e09308 Git GUI: fix Repository>Explore Working Copy (#4357)
This is a companion PR of https://github.com/prati0100/git-gui/pull/95

Since Git v2.39.1, we are a bit more stringent in searching the PATH. In
particular, we specifically require the `.exe` suffix.

However, the `Repository>Explore Working Copy` command asks for
`explorer.exe` to be found on the `PATH`, which _already_ has that
suffix.

Let's unstartle the PATH-finding logic about this scenario.

This fixes https://github.com/git-for-windows/git/issues/4356
2024-10-03 08:57:31 +02:00
Johannes Schindelin
f6a0aa8bbb Skip linking the "dashed" git-<command>s for built-ins (#4252)
It is merely a historical wart that, say, `git-commit` exists in the
`libexec/git-core/` directory, a tribute to the original idea to let Git
be essentially a bunch of Unix shell scripts revolving around very few
"plumbing" (AKA low-level) commands.

Git has evolved a lot from there. These days, most of Git's
functionality is contained within the `git` executable, in the form of
"built-in" commands.

To accommodate for scripts that use the "dashed" form of Git commands,
even today, Git provides hard-links that make the `git` executable
available as, say, `git-commit`, just in case that an old script has not
been updated to invoke `git commit`.

Those hard-links do not come cheap: they take about half a minute for
every build of Git on Windows, they are mistaken for taking up huge
amounts of space by some Windows Explorer versions that do not
understand hard-links, and therefore many a "bug" report had to be
addressed.

The "dashed form" has been officially deprecated in Git version 1.5.4,
which was released on February 2nd, 2008, i.e. a very long time ago.
This deprecation was never finalized by skipping these hard-links, but
we can start the process now, in Git for Windows.

This addresses the concern raised in
https://github.com/git-for-windows/git/pull/4185#discussion_r1051661894
2024-10-03 08:57:31 +02:00
Johannes Schindelin
e0ff312c57 Fix global repository field not being cleared (#4083)
It is checked for w.r.t. global repository struct down in the callstack
in compatibility layer for MinGW before being assigned in the function
that `free()`'d it.
2024-10-03 08:57:31 +02:00
Johannes Schindelin
a2575dcf2b Add support for CLANGARM64 target (#3916)
**This requires an ARM64-machine with Windows 11 installed (which
supports x64 emulation for MSYS2)**

### The main idea

- Use the main MSYS2/git-sdk-64 setup, which works on Windows 11 on ARM
thanks to x64-emulation
- Configure the official `clangarm64` MSYS2 repo
- Install `mingw-w64-clang-aarch64-toolchain` which contains the
ARM64-native Clang compiler
2024-10-03 08:57:30 +02:00
Johannes Schindelin
2080018704 Merge branch 'builtin-swap-functions'
Do prefer GCC's built-in bit-swap functions.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:30 +02:00
Johannes Schindelin
7195c553b6 Fix Windows version resources (#4092)
Add `FileVersion`, which is a required string ([Microsoft
documentation](https://learn.microsoft.com/en-us/windows/win32/menurc/versioninfo-resource))
in the `StringFileInfo` block.
As not all required strings were present in the block, none were being
included.
Fixes #4090

After including the `FileVersion` string, all other defined strings are
now being included on executables.

File version information for `git.exe` has changed from:
```
PS C:\Program Files\Git\bin> [System.Diagnostics.FileVersionInfo]::GetVersionInfo("C:\Data\git-sdk-64\usr\src\git\git.exe") | Select-Object *

FileVersionRaw     : 2.38.1.1
ProductVersionRaw  : 2.38.1.1
Comments           :
CompanyName        :
FileBuildPart      : 1
FileDescription    :
FileMajorPart      : 2
FileMinorPart      : 38
FileName           : C:\Data\git-sdk-64\usr\src\git\git.exe
FilePrivatePart    : 1
FileVersion        :
InternalName       :
IsDebug            : False
IsPatched          : False
IsPrivateBuild     : False
IsPreRelease       : False
IsSpecialBuild     : False
Language           : English (United States)
LegalCopyright     :
LegalTrademarks    :
OriginalFilename   :
PrivateBuild       :
ProductBuildPart   : 1
ProductMajorPart   : 2
ProductMinorPart   : 38
ProductName        :
ProductPrivatePart : 1
ProductVersion     :
SpecialBuild       :
```

To the following:
```
PS C:\Program Files\Git\bin> [System.Diagnostics.FileVersionInfo]::GetVersionInfo("C:\Data\git-sdk-64\usr\src\git\git.exe") | Select-Object *

FileVersionRaw     : 2.38.1.1
ProductVersionRaw  : 2.38.1.1
Comments           :
CompanyName        : The Git Development Community
FileBuildPart      : 1
FileDescription    : Git for Windows
FileMajorPart      : 2
FileMinorPart      : 38
FileName           : C:\Data\git-sdk-64\usr\src\git\git.exe
FilePrivatePart    : 1
FileVersion        : 2.38.1.windows.1.10.g6ed65a6fab
InternalName       : git
IsDebug            : False
IsPatched          : False
IsPrivateBuild     : False
IsPreRelease       : False
IsSpecialBuild     : False
Language           : English (United States)
LegalCopyright     :
LegalTrademarks    :
OriginalFilename   : git.exe
PrivateBuild       :
ProductBuildPart   : 1
ProductMajorPart   : 2
ProductMinorPart   : 38
ProductName        : Git
ProductPrivatePart : 1
ProductVersion     : 2.38.1.windows.1.10.g6ed65a6fab
SpecialBuild       :
```

I wasn't really expecting `GIT_VERSION` to contain the Git commit, I was
hoping for just `2.38.1` or `2.38.1.1`, at least for the `FileVersion`
string.

Anybody know if it's possible to concatenate the `MAJOR`, `MINOR`,
`MICRO`, and `PATCHLEVEL` fields with dots, or if there's another
variable that can be used (with or without `PATCHLEVEL`)?
Alternatively, use the complete `GIT_VERSION` for both `FileVersion` and
`ProductVersion`.
2024-10-03 08:57:30 +02:00
Johannes Schindelin
f2a98e9cc0 Merge pull request #3942 from rimrul/mingw-tsaware
MinGW: link as terminal server aware
2024-10-03 08:57:30 +02:00
Johannes Schindelin
5139673e38 Merge branch 'ci-fixes'
Backport a couple fixes to make the CI build run again (so much for
reproducible builds...).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:29 +02:00
Johannes Schindelin
cc588b47d4 Merge branch 'fsync-object-files-always'
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:29 +02:00
Johannes Schindelin
4955a5717d Merge branch 'optionally-dont-append-atomically-on-windows'
Fix append failure issue under remote directories #2753

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:29 +02:00
Johannes Schindelin
0896560f1b Merge pull request #3875 from 1480c1/wine/detect_msys_tty
winansi: check result before using Name for pty
2024-10-03 08:57:29 +02:00
Johannes Schindelin
2f7e2d877c Merge pull request #3751 from rkitover/native-term
mingw: set $env:TERM=xterm-256color for newer OSes
2024-10-03 08:57:29 +02:00
Derrick Stolee
68d57b17a6 Merge pull request #3791: Various fixes around safe.directory
The first three commits are rebased versions of those in gitgitgadget/git#1215. These allow the following:

1. Fix `git config --global foo.bar <path>` from allowing the `<path>`. As a bonus, users with a config value starting with `/` will not get a warning about "old-style" paths needing a "`%(prefix)/`".

2. When in WSL, the path starts with `/` so it needs to be interpolated properly. Update the warning to include `%(prefix)/` to get the right value for WSL users. (This is specifically for using Git for Windows from Git Bash, but in a WSL directory.)

3. When using WSL, the ownership check fails and reports an error message. This is noisy, and happens even if the user has marked the path with `safe.directory`. Remove that error message.
2024-10-03 08:57:28 +02:00
Johannes Schindelin
37075e88ea Merge pull request #3533 from PhilipOakley/hashliteral_t
Begin `unsigned long`->`size_t` conversion to support large files on Windows
2024-10-03 08:57:28 +02:00
Johannes Schindelin
2be2c924d1 survey: clearly note the experimental nature in the output
While this command is definitely something we _want_, chances are that
upstreaming this will require substantial changes.

We still want to be able to experiment with this before that, to focus
on what we need out of this command: To assist with diagnosing issues
with large repositories, as well as to help monitoring the growth and
the associated painpoints of such repositories.

To that end, we are about to integrate this command into
`microsoft/git`, to get the tool into the hands of users who need it
most, with the idea to iterate in close collaboration between these
users and the developers familar with Git's internals.

However, we will definitely want to avoid letting anybody have the
impression that this command, its exact inner workings, as well as its
output format, are anywhere close to stable. To make that fact utterly
clear (and thereby protect the freedom to iterate and innovate freely
before upstreaming the command), let's mark its output as experimental
in all-caps, as the first thing we do.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:45:24 +02:00
Derrick Stolee
7e30b29b06 survey: add --top=<N> option and config
The 'git survey' builtin provides several detail tables, such as "top
files by on-disk size". The size of these tables defaults to 100,
currently.

Allow the user to specify this number via a new --top=<N> option or the
new survey.top config key.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:45:24 +02:00
Derrick Stolee
5484401973 survey: add report of "largest" paths
Since we are already walking our reachable objects using the path-walk API,
let's now collect lists of the paths that contribute most to different
metrics. Specifically, we care about

 * Number of versions.
 * Total size on disk.
 * Total inflated size (no delta or zlib compression).

This information can be critical to discovering which parts of the
repository are causing the most growth, especially on-disk size. Different
packing strategies might help compress data more efficiently, but the toal
inflated size is a representation of the raw size of all snapshots of those
paths. Even when stored efficiently on disk, that size represents how much
information must be processed to complete a command such as 'git blame'.

Since the on-disk size is likely to be fragile, stop testing the exact
output of 'git survey' and check that the correct set of headers is
output.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:45:24 +02:00
Derrick Stolee
34983c13e0 survey: add ability to track prioritized lists
In future changes, we will make use of these methods. The intention is to
keep track of the top contributors according to some metric. We don't want
to store all of the entries and do a sort at the end, so track a
constant-size table and remove rows that get pushed out depending on the
chosen sorting algorithm.

Co-authored-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by; Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:45:23 +02:00
Derrick Stolee
6a43d0737b survey: show progress during object walk
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:45:23 +02:00
Derrick Stolee
f39486ae23 survey: summarize total sizes by object type
Now that we have explored objects by count, we can expand that a bit more to
summarize the data for the on-disk and inflated size of those objects. This
information is helpful for diagnosing both why disk space (and perhaps
clone or fetch times) is growing but also why certain operations are slow
because the inflated size of the abstract objects that must be processed is
so large.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:45:23 +02:00
Derrick Stolee
f2f89bc4ae survey: add object count summary
At the moment, nothing is obvious about the reason for the use of the
path-walk API, but this will become more prevelant in future iterations. For
now, use the path-walk API to sum up the counts of each kind of object.

For example, this is the reachable object summary output for my local repo:

REACHABLE OBJECT SUMMARY
========================
Object Type |  Count
------------+-------
       Tags |   1343
    Commits | 179344
      Trees | 314350
      Blobs | 184030

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:45:23 +02:00
Derrick Stolee
8f9e2201cf survey: start pretty printing data in table form
When 'git survey' provides information to the user, this will be presented
in one of two formats: plaintext and JSON. The JSON implementation will be
delayed until the functionality is complete for the plaintext format.

The most important parts of the plaintext format are headers specifying the
different sections of the report and tables providing concreted data.

Create a custom table data structure that allows specifying a list of
strings for the row values. When printing the table, check each column for
the maximum width so we can create a table of the correct size from the
start.

The table structure is designed to be flexible to the different kinds of
output that will be implemented in future changes.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:45:23 +02:00
Jeff Hostetler
5bac1adb95 survey: add command line opts to select references
By default we will scan all references in "refs/heads/", "refs/tags/"
and "refs/remotes/".

Add command line opts let the use ask for all refs or a subset of them
and to include a detached HEAD.

Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:45:21 +02:00
Jeff Hostetler
714beddd72 survey: stub in new experimental 'git-survey' command
Start work on a new 'git survey' command to scan the repository
for monorepo performance and scaling problems.  The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.

The initial goal is to complement the scanning and analysis performed
by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.

Co-authored-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:45:14 +02:00
Johannes Schindelin
192f63ed1b backfill: mark it as experimental
This is a highly useful command, and we want it to get some testing "in
the wild". However, the patches have not yet been reviewed on the Git
mailing list, and are therefore subject to change. By marking the
command as experimental, users will be warned to pay attention to those
changes.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:44:55 +02:00
Derrick Stolee
bff91f1d5e backfill: assume --sparse when sparse-checkout is enabled
The previous change introduced the '--[no-]sparse' option for the 'git
backfill' command, but did not assume it as enabled by default. However,
this is likely the behavior that users will most often want to happen.
Without this default, users with a small sparse-checkout may be confused
when 'git backfill' downloads every version of every object in the full
history.

However, this is left as a separate change so this decision can be reviewed
independently of the value of the '--[no-]sparse' option.

Add a test of adding the '--sparse' option to a repo without sparse-checkout
to make it clear that supplying it without a sparse-checkout is an error.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:53 +02:00
Derrick Stolee
120e50cb9a backfill: add --sparse option
One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensie as computing blob diffs will
trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:43 +02:00
Derrick Stolee
439210584b backfill: add --batch-size=<n> option
Users may want to specify a minimum batch size for their needs. This is only
a minimum: the path-walk API provides a list of OIDs that correspond to the
same path, and thus it is optimal to allow delta compression across those
objects in a single server request.

We could consider limiting the request to have a maximum batch size in the
future.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:35 +02:00
Derrick Stolee
2fae6ee07c backfill: basic functionality and tests
The default behavior of 'git backfill' is to fetch all missing blobs that
are reachable from HEAD. Document and test this behavior.

The implementation is a very simple use of the path-walk API, initializing
the revision walk at HEAD to start the path-walk from all commits reachable
from HEAD. Ignore the object arrays that correspond to tree entries,
assuming that they are all present already.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:27 +02:00
Derrick Stolee
1dedc040f0 backfill: add builtin boilerplate
In anticipation of implementing 'git backfill', populate the necessary files
with the boilerplate of a new builtin.

RFC TODO: When preparing this for a full implementation, make sure it is
based on the newest standards introduced by [1].

[1] https://lore.kernel.org/git/xmqqjzfq2f0f.fsf@gitster.g/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7
    [PATCH 1/3] builtin: add a repository parameter for builtin functions

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:19 +02:00
Derrick Stolee
4f530f7263 pack-objects: thread the path-based compression
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                                    HEAD~1    HEAD
-------------------------------------------------------------
5313.6: thin pack with --path-walk        0.01    0.01  +0.0%
5313.7: thin pack size with --path-walk    475     475  +0.0%
5313.12: big pack with --path-walk        1.99    1.87  -6.0%
5313.13: big pack size with --path-walk  14.4M   14.3M  -0.4%
5313.18: repack with --path-walk         98.14   41.46 -57.8%
5313.19: repack size with --path-walk   197.2M  197.3M  +0.0%

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
2488d13c1e pack-objects: refactor path-walk delta phase
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but could be
done in a future update.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
c7bd1dfe98 scalar: enable path-walk during push via config
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
5b38114dc6 pack-objects: enable --path-walk via config
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00