Commit Graph

159577 Commits

Author SHA1 Message Date
Johannes Schindelin
1cdf89e88b Introduce 'git backfill' to get missing blobs in a partial clone (#5172)
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
2024-10-03 08:57:34 +02:00
Derrick Stolee
ef977e276d Add path walk API and its use in 'git pack-objects' (#5171)
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget/git#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
2024-10-03 08:57:34 +02:00
Johannes Schindelin
c316eab995 pack-objects: create new name-hash algorithm (#5157)
This is an updated version of gitgitgadget/git#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
2024-10-03 08:57:33 +02:00
Johannes Schindelin
05be223493 Merge branch 'run-command-be-helpful-when-Git-LFS-fails-on-Windows-7'
Since Git LFS v3.5.x implicitly dropped Windows 7 support, we now want
users to be advised _what_ is going wrong on that Windows version. This
topic branch goes out of its way to provide users with such guidance.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:33 +02:00
Johannes Schindelin
d5af88fbcc Merge branch 'Fallback-to-AppData-if-XDG-CONFIG-HOME-is-unset'
This topic branch adds support for a more Windows-native user-wide
config file than `XDG_CONFIG_HOME` (or `~/.config/`) will ever be.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:33 +02:00
Johannes Schindelin
a9d2c13795 Merge branch 'Fix-i686-build-with-GCC-v14'
This fixes a long-time compile warning turned error by GCC v14.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:33 +02:00
Johannes Schindelin
3cfb527016 Merge branch 'run-t5601-and-t7406-with-symlinks-on-windows-10'
This topic branch contains a patch that made it into Git for Windows
v2.45.1 but not into Git v2.45.1 (because the latter does not come with
symlink support on Windows).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:33 +02:00
Johannes Schindelin
98b8cfb898 common-main.c: fflush stdout buffer when exit (#4901) 2024-10-03 08:57:32 +02:00
Johannes Schindelin
76ec3f27ac ARM64: Embed manifest properly (#4718)
Teach our ARM64 based builds to embed the manifest file correctly.

This fixes #4707
2024-10-03 08:57:32 +02:00
Johannes Schindelin
bb0eb0b3e5 win32: use native ANSI sequence processing, if possible (#4700)
Windows 10 version 1511 (also known as Anniversary Update), according to
https://learn.microsoft.com/en-us/windows/console/console-virtual-terminal-sequences
introduced native support for ANSI sequence processing. This allows
using colors from the entire 24-bit color range.

All we need to do is test whether the console's "virtual processing
support" can be enabled. If it can, we do not even need to start the
`console_thread` to handle ANSI sequences.

Incidentally, this addresses
https://github.com/git-for-windows/git/issues/3712.
2024-10-03 08:57:32 +02:00
Johannes Schindelin
e44720f77a Additional error checks for issuing the windows.appendAtomically warning (#4528)
Another (hopefully clean) PR for showing the error warning about atomic
append on windows after failure on APFS, which returns EBADF not EINVAL.

Signed-off-by: David Lomas <dl3@pale-eds.co.uk>
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
2024-10-03 08:57:32 +02:00
Johannes Schindelin
32c3fd83c0 Merge branch 'nano-server'
This patch adds a GitHub workflow (to be triggered manually) to allow
for conveniently verifying that Git and Scalar still work as intended in
Windows Nano Server (a relatively small container base image that is
frequently used where a "small Windows" is needed, e.g. in automation
;-))

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:31 +02:00
Johannes Schindelin
2092813b17 Lazy load libcurl, allowing for an SSL/TLS backend-specific libcurl (#4410)
As per
https://github.com/git-for-windows/git/issues/4350#issuecomment-1485041503,
the major block for upgrading Git for Windows' OpenSSL from v1.1 to v3
is the tricky part where such an upgrade would break `git fetch`/`git
clone` and `git push` because the libcurl depends on the OpenSSL DLL,
and the major version bump will _change_ the file name of said DLL.

To overcome that, the plan is to build libcurl flavors for each
supported SSL/TLS backend, aligning with the way MSYS2 builds libcurl,
then switch Git for Windows' SDK to the Secure Channel-flavored libcurl,
and teach Git to look for the specific flavor of libcurl corresponding
to the `http.sslBackend` setting (if that was configured).

Here is the PR to teach Git that trick.
2024-10-03 08:57:31 +02:00
Johannes Schindelin
4ae2e09308 Git GUI: fix Repository>Explore Working Copy (#4357)
This is a companion PR of https://github.com/prati0100/git-gui/pull/95

Since Git v2.39.1, we are a bit more stringent in searching the PATH. In
particular, we specifically require the `.exe` suffix.

However, the `Repository>Explore Working Copy` command asks for
`explorer.exe` to be found on the `PATH`, which _already_ has that
suffix.

Let's unstartle the PATH-finding logic about this scenario.

This fixes https://github.com/git-for-windows/git/issues/4356
2024-10-03 08:57:31 +02:00
Johannes Schindelin
f6a0aa8bbb Skip linking the "dashed" git-<command>s for built-ins (#4252)
It is merely a historical wart that, say, `git-commit` exists in the
`libexec/git-core/` directory, a tribute to the original idea to let Git
be essentially a bunch of Unix shell scripts revolving around very few
"plumbing" (AKA low-level) commands.

Git has evolved a lot from there. These days, most of Git's
functionality is contained within the `git` executable, in the form of
"built-in" commands.

To accommodate for scripts that use the "dashed" form of Git commands,
even today, Git provides hard-links that make the `git` executable
available as, say, `git-commit`, just in case that an old script has not
been updated to invoke `git commit`.

Those hard-links do not come cheap: they take about half a minute for
every build of Git on Windows, they are mistaken for taking up huge
amounts of space by some Windows Explorer versions that do not
understand hard-links, and therefore many a "bug" report had to be
addressed.

The "dashed form" has been officially deprecated in Git version 1.5.4,
which was released on February 2nd, 2008, i.e. a very long time ago.
This deprecation was never finalized by skipping these hard-links, but
we can start the process now, in Git for Windows.

This addresses the concern raised in
https://github.com/git-for-windows/git/pull/4185#discussion_r1051661894
2024-10-03 08:57:31 +02:00
Johannes Schindelin
e0ff312c57 Fix global repository field not being cleared (#4083)
It is checked for w.r.t. global repository struct down in the callstack
in compatibility layer for MinGW before being assigned in the function
that `free()`'d it.
2024-10-03 08:57:31 +02:00
Johannes Schindelin
a2575dcf2b Add support for CLANGARM64 target (#3916)
**This requires an ARM64-machine with Windows 11 installed (which
supports x64 emulation for MSYS2)**

### The main idea

- Use the main MSYS2/git-sdk-64 setup, which works on Windows 11 on ARM
thanks to x64-emulation
- Configure the official `clangarm64` MSYS2 repo
- Install `mingw-w64-clang-aarch64-toolchain` which contains the
ARM64-native Clang compiler
2024-10-03 08:57:30 +02:00
Johannes Schindelin
2080018704 Merge branch 'builtin-swap-functions'
Do prefer GCC's built-in bit-swap functions.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:30 +02:00
Johannes Schindelin
7195c553b6 Fix Windows version resources (#4092)
Add `FileVersion`, which is a required string ([Microsoft
documentation](https://learn.microsoft.com/en-us/windows/win32/menurc/versioninfo-resource))
in the `StringFileInfo` block.
As not all required strings were present in the block, none were being
included.
Fixes #4090

After including the `FileVersion` string, all other defined strings are
now being included on executables.

File version information for `git.exe` has changed from:
```
PS C:\Program Files\Git\bin> [System.Diagnostics.FileVersionInfo]::GetVersionInfo("C:\Data\git-sdk-64\usr\src\git\git.exe") | Select-Object *

FileVersionRaw     : 2.38.1.1
ProductVersionRaw  : 2.38.1.1
Comments           :
CompanyName        :
FileBuildPart      : 1
FileDescription    :
FileMajorPart      : 2
FileMinorPart      : 38
FileName           : C:\Data\git-sdk-64\usr\src\git\git.exe
FilePrivatePart    : 1
FileVersion        :
InternalName       :
IsDebug            : False
IsPatched          : False
IsPrivateBuild     : False
IsPreRelease       : False
IsSpecialBuild     : False
Language           : English (United States)
LegalCopyright     :
LegalTrademarks    :
OriginalFilename   :
PrivateBuild       :
ProductBuildPart   : 1
ProductMajorPart   : 2
ProductMinorPart   : 38
ProductName        :
ProductPrivatePart : 1
ProductVersion     :
SpecialBuild       :
```

To the following:
```
PS C:\Program Files\Git\bin> [System.Diagnostics.FileVersionInfo]::GetVersionInfo("C:\Data\git-sdk-64\usr\src\git\git.exe") | Select-Object *

FileVersionRaw     : 2.38.1.1
ProductVersionRaw  : 2.38.1.1
Comments           :
CompanyName        : The Git Development Community
FileBuildPart      : 1
FileDescription    : Git for Windows
FileMajorPart      : 2
FileMinorPart      : 38
FileName           : C:\Data\git-sdk-64\usr\src\git\git.exe
FilePrivatePart    : 1
FileVersion        : 2.38.1.windows.1.10.g6ed65a6fab
InternalName       : git
IsDebug            : False
IsPatched          : False
IsPrivateBuild     : False
IsPreRelease       : False
IsSpecialBuild     : False
Language           : English (United States)
LegalCopyright     :
LegalTrademarks    :
OriginalFilename   : git.exe
PrivateBuild       :
ProductBuildPart   : 1
ProductMajorPart   : 2
ProductMinorPart   : 38
ProductName        : Git
ProductPrivatePart : 1
ProductVersion     : 2.38.1.windows.1.10.g6ed65a6fab
SpecialBuild       :
```

I wasn't really expecting `GIT_VERSION` to contain the Git commit, I was
hoping for just `2.38.1` or `2.38.1.1`, at least for the `FileVersion`
string.

Anybody know if it's possible to concatenate the `MAJOR`, `MINOR`,
`MICRO`, and `PATCHLEVEL` fields with dots, or if there's another
variable that can be used (with or without `PATCHLEVEL`)?
Alternatively, use the complete `GIT_VERSION` for both `FileVersion` and
`ProductVersion`.
2024-10-03 08:57:30 +02:00
Johannes Schindelin
f2a98e9cc0 Merge pull request #3942 from rimrul/mingw-tsaware
MinGW: link as terminal server aware
2024-10-03 08:57:30 +02:00
Johannes Schindelin
5139673e38 Merge branch 'ci-fixes'
Backport a couple fixes to make the CI build run again (so much for
reproducible builds...).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:29 +02:00
Johannes Schindelin
cc588b47d4 Merge branch 'fsync-object-files-always'
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:29 +02:00
Johannes Schindelin
4955a5717d Merge branch 'optionally-dont-append-atomically-on-windows'
Fix append failure issue under remote directories #2753

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:57:29 +02:00
Johannes Schindelin
0896560f1b Merge pull request #3875 from 1480c1/wine/detect_msys_tty
winansi: check result before using Name for pty
2024-10-03 08:57:29 +02:00
Johannes Schindelin
2f7e2d877c Merge pull request #3751 from rkitover/native-term
mingw: set $env:TERM=xterm-256color for newer OSes
2024-10-03 08:57:29 +02:00
Derrick Stolee
68d57b17a6 Merge pull request #3791: Various fixes around safe.directory
The first three commits are rebased versions of those in gitgitgadget/git#1215. These allow the following:

1. Fix `git config --global foo.bar <path>` from allowing the `<path>`. As a bonus, users with a config value starting with `/` will not get a warning about "old-style" paths needing a "`%(prefix)/`".

2. When in WSL, the path starts with `/` so it needs to be interpolated properly. Update the warning to include `%(prefix)/` to get the right value for WSL users. (This is specifically for using Git for Windows from Git Bash, but in a WSL directory.)

3. When using WSL, the ownership check fails and reports an error message. This is noisy, and happens even if the user has marked the path with `safe.directory`. Remove that error message.
2024-10-03 08:57:28 +02:00
Johannes Schindelin
37075e88ea Merge pull request #3533 from PhilipOakley/hashliteral_t
Begin `unsigned long`->`size_t` conversion to support large files on Windows
2024-10-03 08:57:28 +02:00
Johannes Schindelin
192f63ed1b backfill: mark it as experimental
This is a highly useful command, and we want it to get some testing "in
the wild". However, the patches have not yet been reviewed on the Git
mailing list, and are therefore subject to change. By marking the
command as experimental, users will be warned to pay attention to those
changes.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-10-03 08:44:55 +02:00
Derrick Stolee
bff91f1d5e backfill: assume --sparse when sparse-checkout is enabled
The previous change introduced the '--[no-]sparse' option for the 'git
backfill' command, but did not assume it as enabled by default. However,
this is likely the behavior that users will most often want to happen.
Without this default, users with a small sparse-checkout may be confused
when 'git backfill' downloads every version of every object in the full
history.

However, this is left as a separate change so this decision can be reviewed
independently of the value of the '--[no-]sparse' option.

Add a test of adding the '--sparse' option to a repo without sparse-checkout
to make it clear that supplying it without a sparse-checkout is an error.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:53 +02:00
Derrick Stolee
120e50cb9a backfill: add --sparse option
One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensie as computing blob diffs will
trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:43 +02:00
Derrick Stolee
439210584b backfill: add --batch-size=<n> option
Users may want to specify a minimum batch size for their needs. This is only
a minimum: the path-walk API provides a list of OIDs that correspond to the
same path, and thus it is optimal to allow delta compression across those
objects in a single server request.

We could consider limiting the request to have a maximum batch size in the
future.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:35 +02:00
Derrick Stolee
2fae6ee07c backfill: basic functionality and tests
The default behavior of 'git backfill' is to fetch all missing blobs that
are reachable from HEAD. Document and test this behavior.

The implementation is a very simple use of the path-walk API, initializing
the revision walk at HEAD to start the path-walk from all commits reachable
from HEAD. Ignore the object arrays that correspond to tree entries,
assuming that they are all present already.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:27 +02:00
Derrick Stolee
1dedc040f0 backfill: add builtin boilerplate
In anticipation of implementing 'git backfill', populate the necessary files
with the boilerplate of a new builtin.

RFC TODO: When preparing this for a full implementation, make sure it is
based on the newest standards introduced by [1].

[1] https://lore.kernel.org/git/xmqqjzfq2f0f.fsf@gitster.g/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7
    [PATCH 1/3] builtin: add a repository parameter for builtin functions

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:19 +02:00
Derrick Stolee
4f530f7263 pack-objects: thread the path-based compression
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                                    HEAD~1    HEAD
-------------------------------------------------------------
5313.6: thin pack with --path-walk        0.01    0.01  +0.0%
5313.7: thin pack size with --path-walk    475     475  +0.0%
5313.12: big pack with --path-walk        1.99    1.87  -6.0%
5313.13: big pack size with --path-walk  14.4M   14.3M  -0.4%
5313.18: repack with --path-walk         98.14   41.46 -57.8%
5313.19: repack size with --path-walk   197.2M  197.3M  +0.0%

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
2488d13c1e pack-objects: refactor path-walk delta phase
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but could be
done in a future update.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
c7bd1dfe98 scalar: enable path-walk during push via config
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
5b38114dc6 pack-objects: enable --path-walk via config
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
a9bc29e91e repack: add --path-walk option
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the results are very interesting:

Test                                           this tree
------------------------------------------------------------------
5313.2: thin pack                              0.40(0.47+0.04)
5313.3: thin pack size                                    1.2M
5313.4: thin pack with --full-name-hash        0.09(0.10+0.04)
5313.5: thin pack size with --full-name-hash             22.8K
5313.6: thin pack with --path-walk             0.08(0.06+0.02)
5313.7: thin pack size with --path-walk                  20.8K
5313.8: big pack                               2.16(8.43+0.23)
5313.9: big pack size                                    17.7M
5313.10: big pack with --full-name-hash        1.42(3.06+0.21)
5313.11: big pack size with --full-name-hash             18.0M
5313.12: big pack with --path-walk             2.21(8.39+0.24)
5313.13: big pack size with --path-walk                  17.8M
5313.14: repack                                98.05(662.37+2.64)
5313.15: repack size                                    449.1K
5313.16: repack with --full-name-hash          33.95(129.44+2.63)
5313.17: repack size with --full-name-hash              182.9K
5313.18: repack with --path-walk               106.21(121.58+0.82)
5313.19: repack size with --path-walk                   159.6K

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

This repo suffers from having a lot of paths that collide in the name
hash, so examining them in groups by path leads to better deltas. Also,
in this case, the single-threaded implementation is competitive with the
full repack. This is saving time diffing files that have significant
differences from each other.

A similar, but private, repo has even more extremes in the thin packs:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              2.39(2.91+0.10)
5313.3: thin pack size                                    4.5M
5313.4: thin pack with --full-name-hash        0.29(0.47+0.12)
5313.5: thin pack size with --full-name-hash             15.5K
5313.6: thin pack with --path-walk             0.35(0.31+0.04)
5313.7: thin pack size with --path-walk                  14.2K

Notice, however, that while the --full-name-hash version is working
quite well in these cases for the thin pack, it does poorly for some
other standard cases, such as this test on the Linux kernel repository:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              0.01(0.00+0.00)
5313.3: thin pack size                                     310
5313.4: thin pack with --full-name-hash        0.00(0.00+0.00)
5313.5: thin pack size with --full-name-hash              1.4K
5313.6: thin pack with --path-walk             0.00(0.00+0.00)
5313.7: thin pack size with --path-walk                    310

Here, the --full-name-hash option does much worse than the default name
hash, but the path-walk option does exactly as well.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
13ff91f665 pack-objects: introduce GIT_TEST_PACK_PATH_WALK
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t0411-clone-from-partial.sh : One test fetches from a repo that does
   not have the boundary objects. This causes the path-based walk to
   fail. Disable the variable for this test.

 - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
   without a boundary object.

 - t5310-pack-bitmaps.sh : One test compares the case when packing with
   bitmaps to the case when packing without them. Since we disable the
   test variable when writing bitmaps, this causes a difference in the
   object list (the --path-walk option adds an extra object). Specify
   --no-path-walk in both processes for the comparison. Another test
   checks for a specific delta base, but when computing dynamically
   without using bitmaps, the base object it too small to be considered
   in the delta calculations so no base is used.

 - t5316-pack-delta-depth.sh : This script cares about certain delta
   choices and their chain lengths. The --path-walk option changes how
   these chains are selected, and thus changes the results of this test.

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

 - t5332-multi-pack-reuse.sh : This test verifies that the preferred
   pack is used for delta reuse when possible. The --path-walk option is
   not currently aware of the preferred pack at all, so finds a
   different delta base.

 - t7406-submodule-update.sh : When using the variable, the --depth
   option collides with the --path-walk feature, resulting in a warning
   message. Disable the variable so this warning does not appear.

I want to call out one specific test change that is only temporary:

 - t5530-upload-pack-error.sh : One test cares specifically about an
   "unable to read" error message. Since the current implementation
   performs delta calculations within the path-walk API callback, a
   different "unable to get size" error message appears. When this
   is changed in a future refactoring, this test change can be reverted.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
8e8a5b87d4 pack-objects: add --path-walk option
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

RFC TODO: It is important to note that this option is inherently
incompatible with using a bitmap index. This walk probably also does not
work with other advanced features, such as delta islands.

Getting ahead of myself, this option compares well with --full-name-hash
when the packfile is large enough, but also performs at least as well as
the default in all cases that I've seen.

RFC TODO: this should probably be recording the batch locations to another
list so they could be processed in a second phase using threads.

RFC TODO: list some examples of how this outperforms previous pack-objects
strategies. (This is coming in later commits that include performance
test changes.)

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
ce40df44d4 pack-objects: extract should_attempt_deltas()
This will be helpful in a future change.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
d0f2213752 path-walk: add prune_all_uninteresting option
This option causes the path-walk API to act like the sparse tree-walk
algorithm implemented by mark_trees_uninteresting_sparse() in
list-objects.c.

Starting from the commits marked as UNINTERESTING, their root trees and
all objects reachable from those trees are UNINTERSTING, at least as we
walk path-by-path. When we reach a path where all objects associated
with that path are marked UNINTERESTING, then do no continue walking the
children of that path.

We need to be careful to pass the UNINTERESTING flag in a deep way on
the UNINTERESTING objects before we start the path-walk, or else the
depth-first search for the path-walk API may accidentally report some
objects as interesting.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
8e5e612e70 revision: create mark_trees_uninteresting_dense()
The sparse tree walk algorithm was created in d5d2e93577 (revision:
implement sparse algorithm, 2019-01-16) and involves using the
mark_trees_uninteresting_sparse() method. This method takes a repository
and an oidset of tree IDs, some of which have the UNINTERESTING flag and
some of which do not.

Create a method that has an equivalent set of preconditions but uses a
"dense" walk (recursively visits all reachable trees, as long as they
have not previously been marked UNINTERESTING). This is an important
difference from mark_tree_uninteresting(), which short-circuits if the
given tree has the UNINTERESTING flag.

A use of this method will be added in a later change, with a condition
set whether the sparse or dense approach should be used.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
e6acd743c7 path-walk: allow visiting tags
In anticipation of using the path-walk API to analyze tags or include
them in a pack-file, add the ability to walk the tags that were included
in the revision walk.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:44:00 +02:00
Derrick Stolee
6fedcc49c9 path-walk: allow consumer to specify object types
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.

This adds the ability to ask for the commits in a list, as well. Future
changes will add the ability to visit annotated tags.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-10-03 08:43:41 +02:00
Johannes Schindelin
3e7ca7a740 Merge pull request #3417 from dscho/initialize-core.symlinks-earlier
init: respect core.symlinks before copying the templates
2024-10-03 08:19:44 +02:00
Johannes Schindelin
8a6c1e2827 Merge pull request #3306 from PhilipOakley/vs-sln
Make Git for Windows start builds in modern Visual Studio
2024-10-03 08:19:44 +02:00
Johannes Schindelin
751a65aa0d Merge pull request #3349 from vdye/feature/ci-subtree-tests
Add `contrib/subtree` test execution to CI builds
2024-10-03 08:19:43 +02:00
Johannes Schindelin
9624e70bf1 Merge pull request #3293 from pascalmuller/http-support-automatically-sending-client-certificate
http: Add support for enabling automatic sending of SSL client certificate
2024-10-03 08:19:43 +02:00
Johannes Schindelin
b154dbf2e1 Merge pull request #3220 from dscho/there-is-no-vs/master-anymore
Let the documentation reflect that there is no vs/master anymore
2024-10-03 08:19:42 +02:00