Files
git/Documentation/git-repack.adoc
Derrick Stolee 5f711504d9 repack: add --path-walk option
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the --path-walk tests in p5313 look like this:

Test                                                     this tree
-------------------------------------------------------------------------
5313.18: thin pack with --path-walk                      0.08(0.06+0.02)
5313.19: thin pack size with --path-walk                           18.4K
5313.20: big pack with --path-walk                       2.10(7.80+0.26)
5313.21: big pack size with --path-walk                            19.8M
5313.22: shallow fetch pack with --path-walk             1.62(3.38+0.17)
5313.23: shallow pack size with --path-walk                        33.6M
5313.24: repack with --path-walk                         81.29(96.08+0.71)
5313.25: repack size with --path-walk                             142.5M

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

Along with the earlier tests in p5313, I'll instead reformat the
comparison as follows:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             439.4M      87.24s
Hash v2             161.7M      21.51s
Path Walk           142.5M      81.29s

There are a few things to notice here:

 1. The benefits of --name-hash-version=2 over --name-hash-version=1 are
    significant, but --path-walk still compresses better than that
    option.

 2. The --path-walk command is still using --name-hash-version=1 for the
    second pass of delta computation, using the increased name hash
    collisions as a potential method for opportunistic compression on
    top of the path-focused compression.

 3. The --path-walk algorithm is currently sequential and does not use
    multiple threads for delta compression. Threading will be
    implemented in a future change so the computation time will improve
    to better compete in this metric.

There are small benefits in size for my copy of the Git repository:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             248.8M      30.44s
Hash v2             249.0M      30.15s
Path Walk           213.2M     142.50s

As well as in the nodejs/node repository [3]:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             739.9M      71.18s
Hash v2             764.6M      67.82s
Path Walk           698.1M     208.10s

[3] https://github.com/nodejs/node

This benefit also repeats in my copy of the Linux kernel repository:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1               2.5G     554.41s
Hash v2               2.5G     549.62s
Path Walk             2.2G    1562.36s

It is important to see that even when the repository shape does not have
many name-hash collisions, there is a slight space boost to be found
using this method.

As this repacking strategy was released in Git for Windows 2.47.0, some
users have reported cases where the --path-walk compression is slightly
worse than the --name-hash-version=2 option. In those cases, it may be
beneficial to combine the two options. However, there has not been a
released version of Git that has both options and I don't have access to
these repos for testing.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-16 12:15:39 -07:00

290 lines
11 KiB
Plaintext

git-repack(1)
=============
NAME
----
git-repack - Pack unpacked objects in a repository
SYNOPSIS
--------
[verse]
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
[--write-midx] [--name-hash-version=<n>] [--path-walk]
DESCRIPTION
-----------
This command is used to combine all objects that do not currently
reside in a "pack", into a pack. It can also be used to re-organize
existing packs into a single, more efficient pack.
A pack is a collection of objects, individually compressed, with
delta compression applied, stored in a single file, with an
associated index file.
Packs are used to reduce the load on mirror systems, backup
engines, disk storage, etc.
OPTIONS
-------
-a::
Instead of incrementally packing the unpacked objects,
pack everything referenced into a single pack.
Especially useful when packing a repository that is used
for private development. Use
with `-d`. This will clean up the objects that `git prune`
leaves behind, but `git fsck --full --dangling` shows as
dangling.
+
Note that users fetching over dumb protocols will have to fetch the
whole new pack in order to get any contained object, no matter how many
other objects in that pack they already have locally.
+
Promisor packfiles are repacked separately: if there are packfiles that
have an associated ".promisor" file, these packfiles will be repacked
into another separate pack, and an empty ".promisor" file corresponding
to the new separate pack will be written.
-A::
Same as `-a`, unless `-d` is used. Then any unreachable
objects in a previous pack become loose, unpacked objects,
instead of being left in the old pack. Unreachable objects
are never intentionally added to a pack, even when repacking.
This option prevents unreachable objects from being immediately
deleted by way of being left in the old pack and then
removed. Instead, the loose unreachable objects
will be pruned according to normal expiry rules
with the next 'git gc' invocation. See linkgit:git-gc[1].
-d::
After packing, if the newly created packs make some
existing packs redundant, remove the redundant packs.
Also run 'git prune-packed' to remove redundant
loose object files.
--cruft::
Same as `-a`, unless `-d` is used. Then any unreachable objects
are packed into a separate cruft pack. Unreachable objects can
be pruned using the normal expiry rules with the next `git gc`
invocation (see linkgit:git-gc[1]). Incompatible with `-k`.
--cruft-expiration=<approxidate>::
Expire unreachable objects older than `<approxidate>`
immediately instead of waiting for the next `git gc` invocation.
Only useful with `--cruft -d`.
--max-cruft-size=<n>::
Repack cruft objects into packs as large as `<n>` bytes before
creating new packs. As long as there are enough cruft packs
smaller than `<n>`, repacking will cause a new cruft pack to
be created containing objects from any combined cruft packs,
along with any new unreachable objects. Cruft packs larger than
`<n>` will not be modified. When the new cruft pack is larger
than `<n>` bytes, it will be split into multiple packs, all of
which are guaranteed to be at most `<n>` bytes in size. Only
useful with `--cruft -d`.
--expire-to=<dir>::
Write a cruft pack containing pruned objects (if any) to the
directory `<dir>`. This option is useful for keeping a copy of
any pruned objects in a separate directory as a backup. Only
useful with `--cruft -d`.
-l::
Pass the `--local` option to 'git pack-objects'. See
linkgit:git-pack-objects[1].
-f::
Pass the `--no-reuse-delta` option to `git-pack-objects`, see
linkgit:git-pack-objects[1].
-F::
Pass the `--no-reuse-object` option to `git-pack-objects`, see
linkgit:git-pack-objects[1].
-q::
--quiet::
Show no progress over the standard error stream and pass the `-q`
option to 'git pack-objects'. See linkgit:git-pack-objects[1].
-n::
Do not update the server information with
'git update-server-info'. This option skips
updating local catalog files needed to publish
this repository (or a direct copy of it)
over HTTP or FTP. See linkgit:git-update-server-info[1].
--window=<n>::
--depth=<n>::
These two options affect how the objects contained in the pack are
stored using delta compression. The objects are first internally
sorted by type, size and optionally names and compared against the
other objects within `--window` to see if using delta compression saves
space. `--depth` limits the maximum delta depth; making it too deep
affects the performance on the unpacker side, because delta data needs
to be applied that many times to get to the necessary object.
+
The default value for --window is 10 and --depth is 50. The maximum
depth is 4095.
--threads=<n>::
This option is passed through to `git pack-objects`.
--window-memory=<n>::
This option provides an additional limit on top of `--window`;
the window size will dynamically scale down so as to not take
up more than '<n>' bytes in memory. This is useful in
repositories with a mix of large and small objects to not run
out of memory with a large window, but still be able to take
advantage of the large window for the smaller objects. The
size can be suffixed with "k", "m", or "g".
`--window-memory=0` makes memory usage unlimited. The default
is taken from the `pack.windowMemory` configuration variable.
Note that the actual memory usage will be the limit multiplied
by the number of threads used by linkgit:git-pack-objects[1].
--max-pack-size=<n>::
Maximum size of each output pack file. The size can be suffixed with
"k", "m", or "g". The minimum size allowed is limited to 1 MiB.
If specified, multiple packfiles may be created, which also
prevents the creation of a bitmap index.
The default is unlimited, unless the config variable
`pack.packSizeLimit` is set. Note that this option may result in
a larger and slower repository; see the discussion in
`pack.packSizeLimit`.
--filter=<filter-spec>::
Remove objects matching the filter specification from the
resulting packfile and put them into a separate packfile. Note
that objects used in the working directory are not filtered
out. So for the split to fully work, it's best to perform it
in a bare repo and to use the `-a` and `-d` options along with
this option. Also `--no-write-bitmap-index` (or the
`repack.writebitmaps` config option set to `false`) should be
used otherwise writing bitmap index will fail, as it supposes
a single packfile containing all the objects. See
linkgit:git-rev-list[1] for valid `<filter-spec>` forms.
--filter-to=<dir>::
Write the pack containing filtered out objects to the
directory `<dir>`. Only useful with `--filter`. This can be
used for putting the pack on a separate object directory that
is accessed through the Git alternates mechanism. **WARNING:**
If the packfile containing the filtered out objects is not
accessible, the repo can become corrupt as it might not be
possible to access the objects in that packfile. See the
`objects` and `objects/info/alternates` sections of
linkgit:gitrepository-layout[5].
-b::
--write-bitmap-index::
Write a reachability bitmap index as part of the repack. This
only makes sense when used with `-a`, `-A` or `-m`, as the bitmaps
must be able to refer to all reachable objects. This option
overrides the setting of `repack.writeBitmaps`. This option
has no effect if multiple packfiles are created, unless writing a
MIDX (in which case a multi-pack bitmap is created).
--pack-kept-objects::
Include objects in `.keep` files when repacking. Note that we
still do not delete `.keep` packs after `pack-objects` finishes.
This means that we may duplicate objects, but this makes the
option safe to use when there are concurrent pushes or fetches.
This option is generally only useful if you are writing bitmaps
with `-b` or `repack.writeBitmaps`, as it ensures that the
bitmapped packfile has the necessary objects.
--keep-pack=<pack-name>::
Exclude the given pack from repacking. This is the equivalent
of having `.keep` file on the pack. `<pack-name>` is the
pack file name without leading directory (e.g. `pack-123.pack`).
The option can be specified multiple times to keep multiple
packs.
--unpack-unreachable=<when>::
When loosening unreachable objects, do not bother loosening any
objects older than `<when>`. This can be used to optimize out
the write of any objects that would be immediately pruned by
a follow-up `git prune`.
-k::
--keep-unreachable::
When used with `-ad`, any unreachable objects from existing
packs will be appended to the end of the packfile instead of
being removed. In addition, any unreachable loose objects will
be packed (and their loose counterparts removed).
-i::
--delta-islands::
Pass the `--delta-islands` option to `git-pack-objects`, see
linkgit:git-pack-objects[1].
-g<factor>::
--geometric=<factor>::
Arrange resulting pack structure so that each successive pack
contains at least `<factor>` times the number of objects as the
next-largest pack.
+
`git repack` ensures this by determining a "cut" of packfiles that need
to be repacked into one in order to ensure a geometric progression. It
picks the smallest set of packfiles such that as many of the larger
packfiles (by count of objects contained in that pack) may be left
intact.
+
Unlike other repack modes, the set of objects to pack is determined
uniquely by the set of packs being "rolled-up"; in other words, the
packs determined to need to be combined in order to restore a geometric
progression.
+
Loose objects are implicitly included in this "roll-up", without respect to
their reachability. This is subject to change in the future.
+
When writing a multi-pack bitmap, `git repack` selects the largest resulting
pack as the preferred pack for object selection by the MIDX (see
linkgit:git-multi-pack-index[1]).
-m::
--write-midx::
Write a multi-pack index (see linkgit:git-multi-pack-index[1])
containing the non-redundant packs.
--name-hash-version=<n>::
Provide this argument to the underlying `git pack-objects` process.
See linkgit:git-pack-objects[1] for full details.
--path-walk::
Pass the `--path-walk` option to the underlying `git pack-objects`
process. See linkgit:git-pack-objects[1] for full details.
CONFIGURATION
-------------
Various configuration variables affect packing, see
linkgit:git-config[1] (search for "pack" and "delta").
By default, the command passes `--delta-base-offset` option to
'git pack-objects'; this typically results in slightly smaller packs,
but the generated packs are incompatible with versions of Git older than
version 1.4.4. If you need to share your repository with such ancient Git
versions, either directly or via the dumb http protocol, then you
need to set the configuration variable `repack.UseDeltaBaseOffset` to
"false" and repack. Access from old Git versions over the native protocol
is unaffected by this option as the conversion is performed on the fly
as needed in that case.
Delta compression is not used on objects larger than the
`core.bigFileThreshold` configuration variable and on files with the
attribute `delta` set to false.
SEE ALSO
--------
linkgit:git-pack-objects[1]
linkgit:git-prune-packed[1]
GIT
---
Part of the linkgit:git[1] suite