The --filter=sparse:<oid> option to 'git pack-objects' allows focusing
an object set to a sparse-checkout definition. This reduces the set of
matching blobs while retaining all reachable trees. No server currently
supports fetching with this filter because it is expensive to compute
and reachability bitmaps do not help without a significant effort to
extend the bitmap feature to store bitmaps for each supported sparse-
checkout definition.
Without focusing on serving fetches and clones with these filters, there
are still benefits that could be realized by making this faster. With
the sparse index, it's more realistic now than ever to be able to
operate a local clone that was bootstrapped by a packfile created with
a sparse filter, because the missing trees are not needed to move a
sparse-checkout from one commit to another or to view the history of any
path in scope. Such clones could perhaps be bootstrapped by partial
bundles.
Previously, constructing these sparse packs has been incredibly
computationally inefficient. The revision walk that explores which
objects are in scope spends a lot of time checking each object to see if
it matches the sparse-checkout patterns, causing quadratic behavior
(number of objects times number of sparse-checkout patterns). This
improves somewhat when using cone-mode sparse-checkout patterns that can
use hashtables and prefix matches to determine containment. However, the
check per object is still too expensive for most cases.
This is where the path-walk feature comes in. We can proceed as normal
by placing objects in bins by path and _then_ check a group of objects
all at once. Since sparse:<oid> only restricts blobs, the path-walk must
include all reachable trees while using the cone-mode patterns to skip
blobs at paths outside the sparse scope. This establishes a baseline for
a potential future "treesparse:<oid>" filter that would also restrict
trees, but introducing such a new filter is deferred to a later change.
The implementation here is focused around loading the sparse-checkout
patterns from the provided object ID and checking that the patterns are
indeed cone-mode patterns. We can then load the correct pattern list
into the path walk context and use the logic that already exists from
bff4555767 (backfill: add --sparse option, 2025-02-03), though that
feature loads sparse-checkout patterns from the worktree's local
settings and also restricts tree objects. We use a combination of errors
and warnings to signal problems during this load. The difference is that
errors are likely fatal for the non-path-walk version while the warnings
are probably just implementation details for the path-walk version and
the 'git pack-objects' command can fall back to the revision walk
version.
Now that the SEEN flag is deferred until after pattern checks (from the
previous commit), handle the case where a tree with a shared OID appears
at both an out-of-cone and in-cone path. When trees are not being pruned
(pl_sparse_trees == 0), the path-walk re-walks the tree at the in-cone
path so that in-cone blobs within it are discovered. The new tests in
t5317 and t6601 demonstrate this behavior and would fail without these
changes.
The performance test p5315 shows the impact of this change when using
sparse filters:
Test HEAD~1 HEAD
----------------------------------------------------------------------
5315.10: repack (sparse:oid) 77.98 77.47 -0.7%
5315.11: repack size (sparse:oid) 187.5M 187.4M -0.0%
5315.12: repack (sparse:oid, --path-walk) 77.91 31.41 -59.7%
5315.13: repack size (sparse:oid, --path-walk) 187.5M 161.1M -14.1%
These performance tests were run on the Git repository. The --path-walk
feature shows meaningful space savings (14% smaller for sparse packs)
and dramatic time savings (60% faster) by leveraging the path-walk's
ability to skip blobs outside the sparse scope.
Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blaue <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit 302aff0922 (backfill: accept revision arguments, 2026-03-26) added
support for accepting revision arguments to backfill. This allows users
to do things like
git backfill --remotes ^v2.3.0
and then run many commands without triggering on-demand downloads of
blobs. However, if they have topics based on v2.3.0, they will likely
still trigger on-demand downloads. Consider, for example, the command
git log -p v2.3.0..topic
This would still trigger on-demand blob loadings after the backfill
command above, because the commit(s) with A as a parent will need to
diff against the blobs in A. In fact, multiple commands need blobs from
the lower boundary of the revision range:
* git log -p A..B # After backfill A..B
* git replay --onto TARGET A..B # After backfill TARGET^! A..B
* git checkout A && git merge B # After backfill A...B
Add an extra --[no-]include-edges flag to allow grabbing blobs from
edge commits. Since the point of backfill is to prevent on-demand blob
loading and these are common commands, default to --include-edges.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
302aff0922 (backfill: accept revision arguments, 2026-03-26) added
support for passing revision arguments to 'git backfill' but documented
them only with a prose sentence:
You may also specify the commit limiting options from
git-rev-list(1).
No other command that accepts revision arguments documents them this
way. Commands like log, shortlog, and replay define a formal
<revision-range> entry and include rev-list-options.adoc. Commands like
bundle, fast-export, and filter-branch, which pass arguments through to
the revision machinery without including the full options file, still
define a formal <git-rev-list-args> entry explaining what is accepted.
Add a formal <revision-range> entry in the synopsis and OPTIONS section,
following the convention used by other commands, and mention that
commit-limiting options from git-rev-list(1) are also accepted.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The existing implementation of 'git backfill' only includes downloading
missing blobs reachable from HEAD. Advanced uses may desire more general
commit limiting options, such as '--all' for all references, specifying a
commit range via negative references, or specifying a recency of use such as
with '--since=<date>'.
All of these options are available if we use setup_revisions() to parse the
unknown arguments with the revision machinery. This opens up a large number
of possibilities, only a small set of which are tested here.
For documentation, we avoid duplicating the option documentation and instead
link to the documentation of 'git rev-list'.
Note that these arguments currently allow specifying a pathspec, which
modifies the commit history checks but does not limit the paths used in the
backfill logic. This will be updated in a future change.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
For better searchability, this commit adds a check to ensure that parameters
expressed in the form of `--[no-]parameter` are not used in the
documentation. In the place of such parameters, the documentation should
list two separate parameters: `--parameter` and `--no-parameter`.
Signed-off-by: Jean-Noël Avila <jn.avila@free.fr>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Lazy-loading missing files in a blobless clone on demand is costly
as it tends to be one-blob-at-a-time. "git backfill" is introduced
to help bulk-download necessary files beforehand.
* ds/backfill:
backfill: assume --sparse when sparse-checkout is enabled
backfill: add --sparse option
backfill: add --min-batch-size=<n> option
backfill: basic functionality and tests
backfill: add builtin boilerplate