This option is still under discussion on the Git mailing list.
We still would like to have some real-world data, and the best way to
get it is to get a Git for Windows release into users' hands so that
they can test it.
Nevertheless, without the official blessing of the Git maintainer, this
optionis experimental, and we need to be clear about that.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Add a new test-tool helper, name-hash, to output the value of the
name-hash algorithms for the input list of strings, one per line.
Since the name-hash values can be stored in the .bitmap files, it is
important that these hash functions do not change across Git versions.
Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
the current values. Due to how these functions are implemented, it would
be difficult to change them without disturbing these values.
Create a performance test that uses test_size to demonstrate how
collisions occur for these hash algorithms. This test helps inform
someone as to the behavior of the name-hash algorithms for their repo
based on the paths at HEAD.
My copy of the Git repository shows modest statistics around the
collisions of the default name-hash algorithm:
Test this tree
-----------------------------------------------------------------
5314.1: paths at head 4.5K
5314.2: number of distinct name-hashes 4.1K
5314.3: number of distinct full-name-hashes 4.5K
5314.4: maximum multiplicity of name-hashes 13
5314.5: maximum multiplicity of fullname-hashes 1
Here, the maximum collision multiplicity is 13, but around 10% of paths
have a collision with another path.
In a more interesting example, the microsoft/fluentui [1] repo had these
statistics at time of committing:
Test this tree
-----------------------------------------------------------------
5314.1: paths at head 19.6K
5314.2: number of distinct name-hashes 8.2K
5314.3: number of distinct full-name-hashes 19.6K
5314.4: maximum multiplicity of name-hashes 279
5314.5: maximum multiplicity of fullname-hashes 1
[1] https://github.com/microsoft/fluentui
That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.
In this repository, no collisions occur for the full-name-hash
algorithm.
In a more extreme example, an internal monorepo had a much worse
collision rate:
Test this tree
-----------------------------------------------------------------
5314.1: paths at head 221.6K
5314.2: number of distinct name-hashes 72.0K
5314.3: number of distinct full-name-hashes 221.6K
5314.4: maximum multiplicity of name-hashes 14.4K
5314.5: maximum multiplicity of fullname-hashes 2
Even in this repository with many more paths at HEAD, the collision rate
was low and the maximum number of paths being grouped into a single
bucket by the full-path-name algorithm was two.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
As custom options are added to 'git pack-objects' and 'git repack' to
adjust how compression is done, use this new performance test script to
demonstrate their effectiveness in performance and size.
The recently-added --full-name-hash option swaps the default name-hash
algorithm with one that attempts to uniformly distribute the hashes
based on the full path name instead of the last 16 characters.
This has a dramatic effect on full repacks for repositories with many
versions of most paths. It can have a negative impact on cases such as
pushing a single change.
This can be seen by running pt5313 on the open source fluentui
repository [1]. Most commits will have this kind of output for the thin
and big pack cases, though certain commits (such as [2]) will have
problematic thin pack size for other reasons.
[1] https://github.com/microsoft/fluentui
[2] a637a06df05360ce5ff21420803f64608226a875
Checked out at the parent of [2], I see the following statistics:
Test this tree
------------------------------------------------------------------
5313.2: thin pack 0.02(0.01+0.01)
5313.3: thin pack size 1.1K
5313.4: thin pack with --full-name-hash 0.02(0.01+0.00)
5313.5: thin pack size with --full-name-hash 3.0K
5313.6: big pack 1.65(3.35+0.24)
5313.7: big pack size 58.0M
5313.8: big pack with --full-name-hash 1.53(2.52+0.18)
5313.9: big pack size with --full-name-hash 57.6M
5313.10: repack 176.52(706.60+3.53)
5313.11: repack size 446.7K
5313.12: repack with --full-name-hash 37.47(134.18+3.06)
5313.13: repack size with --full-name-hash 183.1K
Note that this demonstrates a 3x size _increase_ in the case that
simulates a small "git push". The size change is neutral on the case of
pushing the difference between HEAD and HEAD~1000.
However, the full repack case is both faster and more efficient.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
This also adds the '--full-name-hash' option introduced in the previous
change and adds newlines to the synopsis.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Add a new environment variable to opt-in to the --full-name-hash option
in 'git pack-objects'. This allows for extra testing of the feature
without repeating all of the test scenarios.
But this option isn't free. There are a few tests that change behavior
with the variable enabled.
First, there are a few tests that are very sensitive to certain delta
bases being picked. These are both involving the generation of thin
bundles and then counting their objects via 'git index-pack --fix-thin'
which pulls the delta base into the new packfile. For these tests,
disable the option as a decent long-term option.
Second, there are two tests in t5616-partial-clone.sh that I believe are
actually broken scenarios. While the client is set up to clone the
'promisor-server' repo via a treeless partial clone filter (tree:0),
that filter does not translate to the 'server' repo. Thus, fetching from
these repos causes the server to think that the client has all reachable
trees and blobs from the commits advertised as 'haves'. This leads the
server to providing a thin pack assuming those objects as delta bases.
Changing the name-hash algorithm presents new delta bases and thus
breaks the expectations of these tests. An alternative could be to set
up 'server' as a promisor server with the correct filter enabled. This
may also point out more issues with partial clone being set up as a
remote-based filtering mechanism and not a repository-wide setting. For
now, do the minimal change to make the test work by disabling the test
variable.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
The new '--full-name-hash' option for 'git repack' is a simple
pass-through to the underlying 'git pack-objects' subcommand. However,
this subcommand may have other options and a temporary filename as part
of the subcommand execution that may not be predictable or could change
over time.
The existing test_subcommand method requires an exact list of arguments
for the subcommand. This is too rigid for our needs here, so create a
new method, test_subcommand_flex. Use it to check that the
--full-name-hash option is passing through.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
The pack_name_hash() method has not been materially changed since it was
introduced in ce0bd64299a (pack-objects: improve path grouping
heuristics., 2006-06-05). The intention here is to group objects by path
name, but also attempt to group similar file types together by making
the most-significant digits of the hash be focused on the final
characters.
Here's the crux of the implementation:
/*
* This effectively just creates a sortable number from the
* last sixteen non-whitespace characters. Last characters
* count "most", so things that end in ".c" sort together.
*/
while ((c = *name++) != 0) {
if (isspace(c))
continue;
hash = (hash >> 2) + (c << 24);
}
As the comment mentions, this only cares about the last sixteen
non-whitespace characters. This cause some filenames to collide more
than others. Here are some examples that I've seen while investigating
repositories that are growing more than they should be:
* "/CHANGELOG.json" is 15 characters, and is created by the beachball
[1] tool. Only the final character of the parent directory can
differntiate different versions of this file, but also only the two
most-significant digits. If that character is a letter, then this is
always a collision. Similar issues occur with the similar
"/CHANGELOG.md" path, though there is more opportunity for
differences in the parent directory.
* Localization files frequently have common filenames but differentiate
via parent directories. In C#, the name "/strings.resx.lcl" is used
for these localization files and they will all collide in name-hash.
[1] https://github.com/microsoft/beachball
I've come across many other examples where some internal tool uses a
common name across multiple directories and is causing Git to repack
poorly due to name-hash collisions.
It is clear that the existing name-hash algorithm is optimized for
repositories with short path names, but also is optimized for packing a
single snapshot of a repository, not a repository with many versions of
the same file. In my testing, this has proven out where the name-hash
algorithm does a good job of finding peer files as delta bases when
unable to use a historical version of that exact file.
However, for repositories that have many versions of most files and
directories, it is more important that the objects that appear at the
same path are grouped together.
Create a new pack_full_name_hash() method and a new --full-name-hash
option for 'git pack-objects' to call that method instead. Add a simple
pass-through for 'git repack --full-name-hash' for additional testing in
the context of a full repack, where I expect this will be most
effective.
The hash algorithm is as simple as possible to be reasonably effective:
for each character of the path string, add a multiple of that character
and a large prime number (chosen arbitrarily, but intended to be large
relative to the size of a uint32_t). Then, shift the current hash value
to the right by 5, with overlap. The addition and shift parameters are
standard mechanisms for creating hard-to-predict behaviors in the bits
of the resulting hash.
This is not meant to be cryptographic at all, but uniformly distributed
across the possible hash values. This creates a hash that appears
pseudorandom. There is no ability to consider similar file types as
being close to each other.
In a later change, a test-tool will be added so the effectiveness of
this hash can be demonstrated directly.
For now, let's consider how effective this mechanism is when repacking a
repository with and without the --full-name-hash option. Specifically,
let's use 'git repack -adf [--full-name-hash]' as our test.
On the Git repository, we do not expect much difference. All path names
are short. This is backed by our results:
| Stage | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone | 260 MB | N/A |
| Standard Repack | 127MB | 106s |
| With --full-name-hash | 126 MB | 99s |
This example demonstrates how there is some natural overhead coming from
the cloned copy because the server is hosting many forks and has not
optimized for exactly this set of reachable objects. But the full repack
has similar characteristics with and without --full-name-hash.
However, we can test this in a repository that uses one of the
problematic naming conventions above. The fluentui [2] repo uses
beachball to generate CHANGELOG.json and CHANGELOG.md files, and these
files have very poor delta characteristics when comparing against
versions across parent directories.
| Stage | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone | 694 MB | N/A |
| Standard Repack | 438 MB | 728s |
| With --full-name-hash | 168 MB | 142s |
[2] https://github.com/microsoft/fluentui
In this example, we see significant gains in the compressed packfile
size as well as the time taken to compute the packfile.
Using a collection of repositories that use the beachball tool, I was
able to make similar comparisions with dramatic results. While the
fluentui repo is public, the others are private so cannot be shared for
reproduction. The results are so significant that I find it important to
share here:
| Repo | Standard Repack | With --full-name-hash |
|----------|-----------------|-----------------------|
| fluentui | 438 MB | 168 MB |
| Repo B | 6,255 MB | 829 MB |
| Repo C | 37,737 MB | 7,125 MB |
| Repo D | 130,049 MB | 6,190 MB |
Future changes could include making --full-name-hash implied by a config
value or even implied by default during a full repack.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
As reported in https://lore.kernel.org/git/ZuPKvYP9ZZ2mhb4m@pks.im/,
libcurl v8.10.0 had a regression that was picked up by Git's t5559.30
"large fetch-pack requests can be sent using chunked encoding".
This bug was fixed in libcurl v8.10.1.
Sadly, the macos-13 runner image was updated in the brief window between
these two libcurl versions, breaking each and every CI build, as
reported at https://github.com/git-for-windows/git/issues/5159.
This would usually not matter, we would just ignore the failing CI
builds until the macos-13 runner image is rebuilt in a couple of days,
and then the CI builds would succeed again.
However.
As has become the custom, a surprise Git version was released, and now
that Git for Windows wants to follow suit, since Git for Windows has
this custom of trying to never release a version with a failing CI
build, we _must_ work around it.
This patch implements this work-around, basically for the sake of Git
for Windows v2.46.2's CI build.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
When adding tree objects, we are very careful to avoid adding the same
tree object more than once. There was one small gap in that logic,
though: when adding a root tree object. Two refs can easily share the
same root tree object, and we should still not add it more than once.
When adding tree objects, we are very careful to avoid adding the same
tree object more than once. There was one small gap in that logic,
though: when adding a root tree object. Two refs can easily share the
same root tree object, and we should still not add it more than once.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
macOS with fsmonitor daemon can hang forever when a submodule is
involved, which has been corrected.
* kn/osx-fsmonitor-with-submodules-fix:
fsmonitor OSX: fix hangs for submodules
Usability improvements for running tests in leak-checking mode.
* jk/test-lsan-improvements:
test-lib: check for leak logs after every test
test-lib: show leak-sanitizer logs on --immediate failure
test-lib: stop showing old leak logs
In 6241ce2170 (refs/reftable: reload locked stack when preparing
transaction, 2024-09-24) we have introduced a new test that exercises
how the reftable backend behaves with many concurrent writers all racing
with each other. This test was introduced after a couple of fixes in
this context that should make concurrent writes behave gracefully. As it
turns out though, Windows systems do not yet handle concurrent writes
properly, as we've got two reports for Cygwin and MinGW failing in this
newly added test.
The root cause of this is how we update the "tables.list" file: when
writing a new stack of tables we first write the data into a lockfile
and then rename that file into place. But Windows forbids us from doing
that rename when the target path is open for reading by another process.
And as the test races both readers and writers with each other we are
quite likely to hit this edge case.
This is not a regression: the logic didn't work before the mentioned
commit, and after the commit it performs well on Linux and macOS, and
the situation on Windows should have at least improved a bit. But the
test shows that we need to put more thought into how to make this work
properly there.
Work around the issue by disabling the test on Windows for now. While at
it, increase the locking timeout to address reported timeouts when using
either the address or memory sanitizer, which also tend to significantly
extend the runtime of this test.
This should be revisited after Git v2.47 is out.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
fsmonitor_classify_path_absolute() expects state->path_gitdir_watch.buf
has no trailing '/' or '.' For a submodule, fsmonitor_run_daemon() sets
the value with trailing "/." (as repo_get_git_dir(the_repository) on
Darwin returns ".") so that fsmonitor_classify_path_absolute() returns
IS_OUTSIDE_CONE.
In this case, fsevent_callback() doesn't update cookie_list so that
fsmonitor_publish() does nothing and with_lock__mark_cookies_seen() is
not invoked.
As with_lock__wait_for_cookie() infinitely waits for state->cookies_cond
that with_lock__mark_cookies_seen() should unlock, the whole daemon
hangs.
Remove trailing "/." from state->path_gitdir_watch.buf for submodules
and add a corresponding test in t7527-builtin-fsmonitor.sh. The test is
disabled for MINGW because hangs treated with this patch occur only for
Darwin and there is no simple way to terminate the win32 fsmonitor
daemon that hangs.
Suggested-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Koji Nakamaru <koji.nakamaru@gree.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We tried quite a few things, but this is a failure introduced at the
last -rc before v2.47.0 _and_ it only documents existing behavior as
far as Windows is concerned (concurrent writes are a problem there with
reftables).
So let's punt and simply disable this test for now, to take the pressure
off of v2.47.0.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>