Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.
Add the --path-walk option to the performance tests in p5313.
For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the results are very interesting:
Test this tree
------------------------------------------------------------------
5313.2: thin pack 0.40(0.47+0.04)
5313.3: thin pack size 1.2M
5313.4: thin pack with --full-name-hash 0.09(0.10+0.04)
5313.5: thin pack size with --full-name-hash 22.8K
5313.6: thin pack with --path-walk 0.08(0.06+0.02)
5313.7: thin pack size with --path-walk 20.8K
5313.8: big pack 2.16(8.43+0.23)
5313.9: big pack size 17.7M
5313.10: big pack with --full-name-hash 1.42(3.06+0.21)
5313.11: big pack size with --full-name-hash 18.0M
5313.12: big pack with --path-walk 2.21(8.39+0.24)
5313.13: big pack size with --path-walk 17.8M
5313.14: repack 98.05(662.37+2.64)
5313.15: repack size 449.1K
5313.16: repack with --full-name-hash 33.95(129.44+2.63)
5313.17: repack size with --full-name-hash 182.9K
5313.18: repack with --path-walk 106.21(121.58+0.82)
5313.19: repack size with --path-walk 159.6K
[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08
This repo suffers from having a lot of paths that collide in the name
hash, so examining them in groups by path leads to better deltas. Also,
in this case, the single-threaded implementation is competitive with the
full repack. This is saving time diffing files that have significant
differences from each other.
A similar, but private, repo has even more extremes in the thin packs:
Test this tree
--------------------------------------------------------------
5313.2: thin pack 2.39(2.91+0.10)
5313.3: thin pack size 4.5M
5313.4: thin pack with --full-name-hash 0.29(0.47+0.12)
5313.5: thin pack size with --full-name-hash 15.5K
5313.6: thin pack with --path-walk 0.35(0.31+0.04)
5313.7: thin pack size with --path-walk 14.2K
Notice, however, that while the --full-name-hash version is working
quite well in these cases for the thin pack, it does poorly for some
other standard cases, such as this test on the Linux kernel repository:
Test this tree
--------------------------------------------------------------
5313.2: thin pack 0.01(0.00+0.00)
5313.3: thin pack size 310
5313.4: thin pack with --full-name-hash 0.00(0.00+0.00)
5313.5: thin pack size with --full-name-hash 1.4K
5313.6: thin pack with --path-walk 0.00(0.00+0.00)
5313.7: thin pack size with --path-walk 310
Here, the --full-name-hash option does much worse than the default name
hash, but the path-walk option does exactly as well.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
The previous commit changed the behavior of repack's '--max-cruft-size'
to specify a cruft pack-specific override for '--max-pack-size'.
Introduce a new flag, '--combine-cruft-below-size' which is a
replacement for the old behavior of '--max-cruft-size'. This new flag
does explicitly what it says: it combines together cruft packs which are
smaller than a given threshold, and leaves alone ones which are
larger.
This accomplishes the original intent of '--max-cruft-size', which was
to avoid repacking cruft packs larger than the given threshold.
The new behavior is slightly different. Instead of building up small
packs together until the threshold is met, '--combine-cruft-below-size'
packs up *all* cruft packs smaller than the threshold. This means that
we may make a pack much larger than the given threshold (e.g., if you
aggregate 5 packs which are each 99 MiB in size with a threshold of 100
MiB).
But that's OK: the point isn't to restrict the size of the cruft packs
we generate, it's to avoid working with ones that have already grown too
large. If repositories still want to limit the size of the generated
cruft pack(s), they may use '--max-cruft-size'.
There's some minor test fallout as a result of the slight differences in
behavior between the old meaning of '--max-cruft-size' and the behavior
of '--combine-cruft-below-size'. In the test which is now called
"--combine-cruft-below-size combines packs", we need to use the new flag
over the old one to exercise that test's intended behavior. The
remainder of the changes there are to improve the clarity of the
comments.
Suggested-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 37dc6d8104 (builtin/repack.c: implement support for
`--max-cruft-size`, 2023-10-02), we exposed new functionality that
allowed repositories to specify the behavior of when we should combine
multiple cruft packs together.
This feature was designed to ensure that we never repacked cruft packs
which were larger than the given threshold in order to provide tighter
I/O bounds for repositories that have many unreachable objects. In
essence, specifying '--max-cruft-size=N' instructed 'repack' to
aggregate cruft packs together (in order of ascending size) until the
combine size grows past 'N', and then make a new cruft pack whose
contents includes the packs we rolled up.
But this isn't quite how it works in practice. Suppose for example that
we have two cruft packs which are each 100MiB in size. One might expect
specifying "--max-cruft-size=200M" would combine these two packs
together, and then avoid repacking them until a pruning GC takes place.
In reality, 'repack' would try and aggregate these together, but writing
a pack that is strictly smaller than 200 MiB (since pack-objects'
"--max-pack-size" provides a strict bound for packs containing more than
one object).
So instead we'll write out a pack that is, say, 199 MiB in size, and
then another 1 MiB pack containing the balance. If we later repack the
repository without adding any new unreachable objects, we'll repeat the
same exercise again, making the same 199 MiB and 1 MiB packs each time.
This happens because of a poor choice to bolt the '--max-cruft-size'
functionality onto pack-objects' '--max-pack-size', forcing us to
generate packs which are always smaller than the provided threshold and
thus subject to repacking.
The following commit will introduce a new flag that implements something
similar to the behavior above. Let's prepare for that by making repack's
'--max-cruft-size' flag behave as an cruft pack-specific override for
'--max-pack-size'.
Do so by temporarily repurposing the 'collapse_small_cruft_packs()'
function to instead generate a cruft pack using the same instructions as
if we didn't specify any maximum pack size. The calling code looks
something like:
if (args->max_pack_size && !cruft_expiration) {
collapse_small_cruft_packs(in, args->max_pack_size, existing);
} else {
for_each_string_list_item(item, &existing->non_kept_packs)
fprintf(in, "-%s.pack\n", item->string);
for_each_string_list_item(item, &existing->cruft_packs)
fprintf(in, "-%s.pack\n", item->string);
}
This patch makes collapse_small_cruft_packs() behave identically to the
'else' arm of the conditional above. This repurposing of
'collapse_small_cruft_packs()' is intentional, since it will set us up
nicely to introduce the new behavior in the following commit.
Naturally, there is some test fallout in the test which exercises the
old meaning of '--max-cruft-size'. Mark that test as failing for now to
be dealt with in the following commit. Likewise, add a new test which
explicitly tests the behavior of '--max-cruft-size' to place a hard
limit on the size of any generated cruft pack(s).
Note that this is a breaking change, as it alters the user-visible
behavior of '--max-cruft-size'. But I'm OK changing this behavior in
this instance, since the behavior wasn't accurate to begin with.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
All the documentation .txt files have been renamed to .adoc to help
content aware editors.
* bc/doc-adoc-not-txt:
Remove obsolete ".txt" extensions for AsciiDoc files
doc: use .adoc extension for AsciiDoc files
gitattributes: mark AsciiDoc files as LF-only
editorconfig: add .adoc extension
doc: update gitignore for .adoc extension
We presently use the ".txt" extension for our AsciiDoc files. While not
wrong, most editors do not associate this extension with AsciiDoc,
meaning that contributors don't get automatic editor functionality that
could be useful, such as syntax highlighting and prose linting.
It is much more common to use the ".adoc" extension for AsciiDoc files,
since this helps editors automatically detect files and also allows
various forges to provide rich (HTML-like) rendering. Let's do that
here, renaming all of the files and updating the includes where
relevant. Adjust the various build scripts and makefiles to use the new
extension as well.
Note that this should not result in any user-visible changes to the
documentation.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>