Files
git/Documentation/git-backfill.adoc
Derrick Stolee 2dc858e69e pack-objects: support sparse:oid filter with path-walk
The --filter=sparse:<oid> option to 'git pack-objects' allows focusing
an object set to a sparse-checkout definition. This reduces the set of
matching blobs while retaining all reachable trees. No server currently
supports fetching with this filter because it is expensive to compute
and reachability bitmaps do not help without a significant effort to
extend the bitmap feature to store bitmaps for each supported sparse-
checkout definition.

Without focusing on serving fetches and clones with these filters, there
are still benefits that could be realized by making this faster. With
the sparse index, it's more realistic now than ever to be able to
operate a local clone that was bootstrapped by a packfile created with
a sparse filter, because the missing trees are not needed to move a
sparse-checkout from one commit to another or to view the history of any
path in scope. Such clones could perhaps be bootstrapped by partial
bundles.

Previously, constructing these sparse packs has been incredibly
computationally inefficient. The revision walk that explores which
objects are in scope spends a lot of time checking each object to see if
it matches the sparse-checkout patterns, causing quadratic behavior
(number of objects times number of sparse-checkout patterns). This
improves somewhat when using cone-mode sparse-checkout patterns that can
use hashtables and prefix matches to determine containment. However, the
check per object is still too expensive for most cases.

This is where the path-walk feature comes in. We can proceed as normal
by placing objects in bins by path and _then_ check a group of objects
all at once. Since sparse:<oid> only restricts blobs, the path-walk must
include all reachable trees while using the cone-mode patterns to skip
blobs at paths outside the sparse scope. This establishes a baseline for
a potential future "treesparse:<oid>" filter that would also restrict
trees, but introducing such a new filter is deferred to a later change.

The implementation here is focused around loading the sparse-checkout
patterns from the provided object ID and checking that the patterns are
indeed cone-mode patterns. We can then load the correct pattern list
into the path walk context and use the logic that already exists from
bff4555767 (backfill: add --sparse option, 2025-02-03), though that
feature loads sparse-checkout patterns from the worktree's local
settings and also restricts tree objects. We use a combination of errors
and warnings to signal problems during this load. The difference is that
errors are likely fatal for the non-path-walk version while the warnings
are probably just implementation details for the path-walk version and
the 'git pack-objects' command can fall back to the revision walk
version.

Now that the SEEN flag is deferred until after pattern checks (from the
previous commit), handle the case where a tree with a shared OID appears
at both an out-of-cone and in-cone path. When trees are not being pruned
(pl_sparse_trees == 0), the path-walk re-walks the tree at the in-cone
path so that in-cone blobs within it are discovered. The new tests in
t5317 and t6601 demonstrate this behavior and would fail without these
changes.

The performance test p5315 shows the impact of this change when using
sparse filters:

Test                                              HEAD~1     HEAD
----------------------------------------------------------------------
5315.10: repack (sparse:oid)                      77.98    77.47  -0.7%
5315.11: repack size (sparse:oid)                187.5M   187.4M  -0.0%
5315.12: repack (sparse:oid, --path-walk)         77.91    31.41 -59.7%
5315.13: repack size (sparse:oid, --path-walk)   187.5M   161.1M -14.1%

These performance tests were run on the Git repository. The --path-walk
feature shows meaningful space savings (14% smaller for sparse packs)
and dramatic time savings (60% faster) by leveraging the path-walk's
ability to skip blobs outside the sparse scope.

Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blaue <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-24 18:41:06 +09:00

96 lines
3.6 KiB
Plaintext

git-backfill(1)
===============
NAME
----
git-backfill - Download missing objects in a partial clone
SYNOPSIS
--------
[synopsis]
git backfill [--min-batch-size=<n>] [--[no-]sparse] [--[no-]include-edges] [<revision-range>]
DESCRIPTION
-----------
Blobless partial clones are created using `git clone --filter=blob:none`
and then configure the local repository such that the Git client avoids
downloading blob objects unless they are required for a local operation.
This initially means that the clone and later fetches download reachable
commits and trees but no blobs. Later operations that change the `HEAD`
pointer, such as `git checkout` or `git merge`, may need to download
missing blobs in order to complete their operation.
In the worst cases, commands that compute blob diffs, such as `git blame`,
become very slow as they download the missing blobs in single-blob
requests to satisfy the missing object as the Git command needs it. This
leads to multiple download requests and no ability for the Git server to
provide delta compression across those objects.
The `git backfill` command provides a way for the user to request that
Git downloads the missing blobs (with optional filters) such that the
missing blobs representing historical versions of files can be downloaded
in batches. The `backfill` command attempts to optimize the request by
grouping blobs that appear at the same path, hopefully leading to good
delta compression in the packfile sent by the server.
In this way, `git backfill` provides a mechanism to break a large clone
into smaller chunks. Starting with a blobless partial clone with `git
clone --filter=blob:none` and then running `git backfill` in the local
repository provides a way to download all reachable objects in several
smaller network calls than downloading the entire repository at clone
time.
By default, `git backfill` downloads all blobs reachable from the `HEAD`
commit. This set can be restricted or expanded using various options below.
THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE.
OPTIONS
-------
`--min-batch-size=<n>`::
Specify a minimum size for a batch of missing objects to request
from the server. This size may be exceeded by the last set of
blobs seen at a given path. The default minimum batch size is
50,000.
`--sparse`::
`--no-sparse`::
Only download objects if they appear at a path that matches the
current sparse-checkout. If the sparse-checkout feature is enabled,
then `--sparse` is assumed and can be disabled with `--no-sparse`.
`--include-edges`::
`--no-include-edges`::
Include blobs from boundary commits in the backfill. Useful in
preparation for commands like `git log -p A..B` or `git replay
--onto TARGET A..B`, where A..B normally excludes A but you need
the blobs from A as well. `--include-edges` is the default.
`<revision-range>`::
Backfill only blobs reachable from commits in the specified
revision range. When no _<revision-range>_ is specified, it
defaults to `HEAD` (i.e. the whole history leading to the
current commit). For a complete list of ways to spell
_<revision-range>_, see the "Specifying Ranges" section of
linkgit:gitrevisions[7].
+
You may also use commit-limiting options understood by
linkgit:git-rev-list[1] such as `--first-parent`, `--since`, or pathspecs.
+
Most `--filter=<spec>` options don't work with the purpose of
`git backfill`, but the `sparse:<oid>` filter is integrated to provide a
focused set of paths to download, distinct from the `--sparse` option.
SEE ALSO
--------
linkgit:git-clone[1],
linkgit:git-rev-list[1]
GIT
---
Part of the linkgit:git[1] suite