mirror of
https://github.com/git-for-windows/git.git
synced 2026-06-15 09:41:09 -05:00
e9ae852cd992cc34abc082d4eef8aeae48b25feb
23 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
456efac53b |
path-walk: support combine filter
The `combine` filter takes the intersection of its children, that is:
objects are shown only when all child filters would admit the object.
The preceding patches added support for many individual filter types.
Enable users to compose these filters by implementing support for the
`combine` filter type.
Mapping intersection onto path_walk_info works because every supported
child filter is a monotonic restriction:
- `blob:none`, `tree:0` unconditionally clear `info->blobs` and (for
`tree:0`) `info->trees`; clearing an already-cleared flag is a
no-op.
- `object:type=X` is now expressed as an AND of each type flag with the
filtered type, so applying multiple such filters only refines the
existing set rather than overwrites it.
- `blob:limit=N` has to compose too: the intersection of "size < L1"
and "size < L2" is "size < min(L1, L2)".
Update the `LOFC_BLOB_LIMIT` handler to take the running minimum when
`info->blob_limit` is already set, so a combined filter with, e.g.,
both "blob:limit=10" and "blob:limit=5" produces a limit of 5
regardless of ordering.
- `sparse:oid` is left unchanged. A `combine` filter that includes a
`sparse:oid` is allowed at most once, since the existing handler
refuses to overwrite `info->pl`. Two `sparse:oid` filters in a single
`combine` would be unusual and are rejected with a warning, matching
the standalone `sparse:oid` behavior.
Implementation-wise, the existing `prepare_filters()` called
`list_objects_filter_release()` inside each case branch. That works fine for
top-level filters, but `combine` filters need to recurse over its child
filters without releasing each one in turn (since the parent's release
iterates the sub array). Split `prepare_filters()` into a recursive helper
that performs only the mutation, plus a thin wrapper that calls the helper
and then releases the top-level filter once.
The `LOFC_COMBINE` case in the helper just walks `sub_nr` and recurses;
child filters are released by the wrapper's single
`list_objects_filter_release()` call on the parent (which itself recursively
releases each sub-filter, the same way it always has).
If any sub-filter is unsupported (e.g. "tree:1", "sparse:<path>", or a
not-yet-supported choice), the recursion bubbles a failure up and the
existing pack-objects/backfill fallback paths kick in.
Add coverage in t6601:
- "combine:blob:none+tree:0" collapses to "tree:0"
- "combine:object:type=blob+blob:limit=3" yields only the blobs
smaller than three bytes
- "combine:object:type=blob+object:type=tree" intersects to empty
- "combine:tree:1+blob:none" reports the "tree:1" error.
Update Documentation/git-pack-objects.adoc to add combine to the
list of supported --filter forms.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
||
|
|
2b8d07ef91 |
path-walk: support object:type filter
The `object:type` filter accepts only objects of a single type; it is
the second member of the object-info-only filter family that bitmap
traversal already supports.
Like `blob:none` and `tree:0`, it can be evaluated with nothing more
than the object's type, which is exactly the granularity path-walk's
existing info->{commits,trees,blobs,tags} flags already control.
Map `LOFC_OBJECT_TYPE` in `prepare_filters()` by AND-ing each flag
against the filtered type. A single `object:type=X` filter
applied to the default info (all flags = 1) leaves `info->X = 1` and
all the others 0, which is what we want.
Using an AND rather than straight assignment prepares us for a
subsequent change to implement combined object filters.
The path-walk machinery is mostly already wired for the per-type
distinction:
- `walk_path()` calls `path_fn` for a batch only when the corresponding
`info->X` flag is set, so unwanted types are silently not reported.
- `add_tree_entries()` skips tree entries of type `OBJ_BLOB` when
`info->blobs` is unset, so we don't even allocate paths for them.
- The commit-walk loop short-circuits the root-tree fetch when
`!info->trees && !info->blobs`, so commit-only filters don't descend
into trees at all.
But there are a couple of side effects of the "trees off, blobs on" case
that need fixing:
1. 'setup_pending_objects()' previously skipped pending trees as soon
as `info->trees` was zero. For 'object:type=blob' the call site
needs those pending trees: a lightweight tag pointing to a tree, or
an annotated tag whose peeled target is a tree, can both reach
blobs that are otherwise unreachable from any commit's root tree.
Loosen the gate to "if (!info->trees && !info->blobs) continue" and
similarly retrieve the root_tree_list whenever either trees or
blobs are wanted.
2. The revision machinery's `handle_commit()` drops pending trees when
`revs->tree_objects` is zero (see the 'OBJ_TREE' handler in
revision.c), so by the time path-walk sees the pending list
after `prepare_revision_walk()` the tree-bearing pendings would
already be gone. Fix this by setting
revs->tree_objects = info->trees || info->blobs
so pending trees survive `prepare_revision_walk()` whenever we
need to walk into them. Path-walk still resets tree_objects to
zero immediately after `prepare_revision_walk()` returns, so the
rev-walk itself never enumerates trees redundantly with
path-walk's own descent.
Add coverage in t6601 for each of the four `object:type` values. The
'object:type=blob' test in particular asserts that file2 and child/file
(both reachable only through tag-pointed trees) show up in the output,
exercising the pending-tree fix.
Update Documentation/git-pack-objects.adoc to add object:type to
the list of supported --filter forms.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
||
|
|
5111520e2a |
path-walk: support tree:0 filter
The `tree:0` object filter omits all trees and blobs from the result, keeping only commits and tags. Consequently, this filter type should has a fairly straightforward integration with path-walk, as the decision to include an object depends only on its type and does not depend on any path-sensitive state. Mapping it onto `path_walk_info` is direct: set `info->trees = 0` and `info->blobs = 0` in `prepare_filters()` when the `LOFC_TREE_DEPTH` choice is requested with depth zero. The existing code already plumbs those flags through the rest of the walk: - 'walk_objects_by_path()' sets `revs->blob_objects = info->blobs` and `revs->tree_objects = info->trees` before `prepare_revision_walk()`, so the revision walk doesn't try to enumerate trees or blobs itself. - The commit-walk loop short-circuits the root-tree fetch with "if (!info->trees && !info->blobs) continue;", so we never even look up the root tree, let alone descend into it. - `setup_pending_objects()` skips pending trees and blobs based on the same flags. This means the path-walk doesn't allocate or expand any tree structures at all under `tree:0`, which matches the intended behavior of the filter. However, this requires first fixing some issues with how the path-walk API handles directly-requested trees _and_ trees requested through lightweight tags. These changes create substantial updates to t6601-path-walk.sh, which the previous change highlighted as a problem by tagging otherwise-unreachable trees and having them not appear in the output. Non-zero tree-depth filters are not supported. Those depend on the depth at which a tree is visited, which is a path-walk concept the filter machinery doesn't currently share with the path-walk API. Reject them in `prepare_filters()` with a helpful error and let pack-objects fall back to the regular traversal, the same way it already does for unsupported filters. Add coverage in t6601 for both `--all` and a single-branch case to confirm that no trees or blobs are emitted, and a separate test that `tree:1` is rejected with the expected error message. Place the new tests before "setup sparse filter blob" so they run on the original set of refs, before the orphan branch that the sparse-tree tests create. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
2dc858e69e |
pack-objects: support sparse:oid filter with path-walk
The --filter=sparse:<oid> option to 'git pack-objects' allows focusing
an object set to a sparse-checkout definition. This reduces the set of
matching blobs while retaining all reachable trees. No server currently
supports fetching with this filter because it is expensive to compute
and reachability bitmaps do not help without a significant effort to
extend the bitmap feature to store bitmaps for each supported sparse-
checkout definition.
Without focusing on serving fetches and clones with these filters, there
are still benefits that could be realized by making this faster. With
the sparse index, it's more realistic now than ever to be able to
operate a local clone that was bootstrapped by a packfile created with
a sparse filter, because the missing trees are not needed to move a
sparse-checkout from one commit to another or to view the history of any
path in scope. Such clones could perhaps be bootstrapped by partial
bundles.
Previously, constructing these sparse packs has been incredibly
computationally inefficient. The revision walk that explores which
objects are in scope spends a lot of time checking each object to see if
it matches the sparse-checkout patterns, causing quadratic behavior
(number of objects times number of sparse-checkout patterns). This
improves somewhat when using cone-mode sparse-checkout patterns that can
use hashtables and prefix matches to determine containment. However, the
check per object is still too expensive for most cases.
This is where the path-walk feature comes in. We can proceed as normal
by placing objects in bins by path and _then_ check a group of objects
all at once. Since sparse:<oid> only restricts blobs, the path-walk must
include all reachable trees while using the cone-mode patterns to skip
blobs at paths outside the sparse scope. This establishes a baseline for
a potential future "treesparse:<oid>" filter that would also restrict
trees, but introducing such a new filter is deferred to a later change.
The implementation here is focused around loading the sparse-checkout
patterns from the provided object ID and checking that the patterns are
indeed cone-mode patterns. We can then load the correct pattern list
into the path walk context and use the logic that already exists from
|
||
|
|
8ff8de7616 |
path-walk: add pl_sparse_trees to control tree pruning
The path-walk API prunes trees and blobs when a sparse-checkout pattern list is provided, which is the correct behavior for 'git backfill --sparse' since it only needs to fill in objects at paths within the sparse cone. However, a future change will use the path-walk API with a sparse:<oid> filter that restricts only blobs while retaining all reachable trees. To support both behaviors, add a 'pl_sparse_trees' flag to path_walk_info. When set (as in 'git backfill --sparse' and the --stdin-pl test helper mode), the sparse patterns prune both trees and blobs. When unset, only blobs are filtered and all trees are walked and reported. Additionally, move the SEEN flag assignment in add_tree_entries() to after the sparse pattern and pathspec checks. Previously, SEEN was set immediately upon discovering an object, before checking whether its path matched the sparse patterns. When the same object ID appeared at multiple paths (e.g. sibling directories with identical contents), the first path to be visited would mark the object as SEEN. If that path was outside the sparse cone, the object would be skipped there but also never discovered at its in-cone path. By deferring the SEEN flag until after the checks pass, objects that are skipped due to sparse filtering remain discoverable at other paths where they may be in scope. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
f1b5d3da16 |
path-walk: support blob size limit filter
Extend the path-walk API to handle the 'blob:limit=<size>' object filter natively. This filter omits blobs whose size is equal to or greater than the given limit, matching the semantics used by the list-objects-filter machinery. When revs->filter.choice is LOFC_BLOB_LIMIT, the prepare_filters() method stores the limit value in info->blob_limit and clears the filter from revs. If the limit is zero, this degenerates to blob:none (all blobs excluded), so info->blobs is set to 0 instead. During walk_path(), blob batches are filtered before being delivered to the callback: each blob's size is checked via odb_read_object_info(), and only blobs strictly smaller than the limit are included. Blobs whose size cannot be determined (e.g. missing in a partial clone) are conservatively included, matching the existing filter behavior. Empty batches after filtering are skipped entirely. The check for inclusion in the path batch looks a little strange at first glance. We use odb_read_object_info() to read the object's size. Based on all of the assumptions to this point, this _should_ return OBJ_BLOB. Since we are focused on the size filter, we use a short-circuited OR (||) to skip the size check if that method returns a different object type. Notice that this inspection of object sizes requires the content to be present in the repository. The odb_read_object_info() call will download a missing blob on-demand. This means that the use of the path-walk API within 'git backfill' would not operate nicely with this filter type. The intention of that command is to download missing blobs in batches. Downloading objects one-by-one would go against the point. Update the validation in 'git backfill' to add its own compatibility check on top of path_walk_filter_compatible(). Add tests for blob:limit=0 (equivalent to blob:none) and blob:limit=3 (which exercises partial filtering within a batch where some blobs are kept and others are excluded). Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
6d87f0e8a3 |
path-walk: support blobless filter
The 'git pack-objects' command can opt-in to using the path-walk API for scanning the objects. Currently, this option is dynamically disabled if combined with '--filter=<X>', even when using a simple filter such as 'blob:none' to signal a blobless packfile. This is a common scenario for repos at scale, so is worth integrating. Also, users can opt-in to the '--path-walk' option by default through the pack.usePathWalk=true config option. When using that in a blobless partial clone, the following warning can appear even though the user did not specify either option directly: warning: cannot use --filter with --path-walk Teach the path-walk API to handle the 'blob:none' object filter natively. When revs->filter.choice is LOFC_BLOB_NONE, the path-walk sets info->blobs to 0 (skipping all blob objects) and clears the filter from revs so that prepare_revision_walk() does not reject the configuration. This check is implemented in the static prepare_filters() method, which will simultaneously check if the input filters are compatible and will make the appropriate mutations to the path_walk_info and filters if the path_walk_info is non-NULL. This allows us to use this logic both in the API method path_walk_filter_compatible() for use in builtin/pack-objects.c and as a prep step in walk_objects_by_path(). Update the test helper (test-path-walk) to accept --filter=<spec> as a test-tool option (before '--'), applying it to revs after setup_revisions() to avoid the --objects requirement check. We can also revert recent GIT_TEST_PACK_PATH_WALK overrides in t5620. Also switch test-path-walk from REV_INFO_INIT with manual repo assignment to repo_init_revisions(), which properly initializes the filter_spec strbuf needed for filter parsing. Add tests for blob:none with --all and with a single branch. The performance test p5315 shows the impact of this change when using blobless filters: Test HEAD~1 HEAD --------------------------------------------------------------------- 5315.6: repack (blob:none) 13.53 13.87 +2.5% 5315.7: repack size (blob:none) 137.7M 137.8M +0.1% 5315.8: repack (blob:none, --path-walk) 13.51 23.43 +73.4% 5315.9: repack size (blob:none, --path-walk) 137.7M 115.2M -16.3% These performance tests were run on the Git repository. The --path-walk feature shows meaningful space savings (16% smaller for blobless packs) at the cost of increased computation time due to the two compression passes. This data demonstrates that the feature is engaged and provides real compression benefits when --no-reuse-delta forces fresh deltas. Co-Authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
7a7070eebc |
path-walk: always emit directly-requested objects
We are preparing to integrate the path-walk API with some --filter options in 'git pack-objects', but there is a subtle issue that is revealed when those are put together and the test suite is run with GIT_TEST_PACK_PATH_WALK=1. When a filter reduces the set of requested objects, this results in filtering out directly-requested objects, such as in the download of needed blobs in a blobless partial clone. The root cause is that the scan of pending objects in the path-walk API respects the filters set in the path_walk_info instead of overriding them for pending objects. We can tell that a path is part of the directly-referenced objects if its path name starts with '/' (other paths, including root trees never have this starting character). Create a path_is_for_direct_objects() to make this meaning clear, especially as we add more references in the future as we integrate the path-walk API with partial clone filter options. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
59063fe8b4 |
Merge branch 'yc/path-walk-fix-error-reporting'
The value of a wrong pointer variable was referenced in an error message that reported that it shouldn't be NULL. * yc/path-walk-fix-error-reporting: path-walk: fix NULL pointer dereference in error message |
||
|
|
3f20c21a1c |
path-walk: support wildcard pathspecs for blob filtering
Previously, walk_objects_by_path() silently ignored pathspecs containing
wildcards or magic by clearing them. This caused all blobs to be
downloaded regardless of the given pathspec. Wildcard pathspecs like
"d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
during 'git backfill').
Support wildcard pathspecs by making two changes:
1. Add an 'exact_pathspecs' flag to path_walk_context. When the
pathspec has no wildcards or magic, set this flag and use the
existing fast-path prefix matching in add_tree_entries(). When
wildcards are present, skip that block since prefix matching
cannot handle glob patterns.
2. Add a match_pathspec() check in walk_path() to filter out blobs
whose full path does not match the pathspec. This provides the
actual blob-level filtering for wildcard pathspecs.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
||
|
|
7be182045a |
backfill: work with prefix pathspecs
The previous change allowed specifying revision arguments over the 'git backfill' command-line. This created the opportunity for restricting the initial commit set by filtering the revision walk through a pathspec. Other than filtering the commit set (and thereby the root trees), this did not restrict the path-walk implementation of 'git backfill' and did not restrict the blobs that were downloaded to only those matching the pathspec. Update the path-walk API to accept certain kinds of pathspecs and to silently ignore anything too complex, for now. We will update this in the next change to properly restrict to even complex pathspecs. The current behavior focuses on pathspecs that match paths exactly. This includes exact filenames, including directory names as prefixes. Pathspecs containing wildcards or magic are cleared so the path walk downloads all blobs, as before. The reason for this restriction is to allow for a faster execution by pruning the path walk to only trees that could contribute towards one of those paths as a parent directory. The test directory 'd/f/' (next to 'd/file*.txt') was prepared in a previous commit to exercise the subtlety in prefix matching. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
753ecf4205 |
path-walk: fix NULL pointer dereference in error message
When lookup_tree() or lookup_blob() cannot find a tree entry's object,
'o' is set to NULL via:
o = child ? &child->object : NULL;
The subsequent null-check catches this correctly, but then dereferences
'o' to format the error message:
error(_("failed to find object %s"), oid_to_hex(&o->oid));
This causes a segfault instead of the intended diagnostic output.
Fix this by using &entry.oid instead. 'entry' is the struct name_entry
populated by tree_entry() on each loop iteration and holds the OID of
the failing lookup -- which is exactly what the error should report.
This crash is reachable via git-backfill(1) when a tree entry's object
is absent from the local object database.
Signed-off-by: Yuvraj Singh Chauhan <ysinghcin@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
||
|
|
d473865154 |
path-walk: use repo_parse_tree_gently()
Use the passed in repository instead of the implicit the_repository when parsing the tree. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
93afe9b060 |
path-walk: create initializer for path lists
The previous change fixed a bug in 'git repack -adf --path-walk' that was due to an update to how path lists are initialized and missing some important cases when processing the pending objects. This change takes the three critical places where path lists are initialized and combines them into a static method. This simplifies the callers somewhat while also helping to avoid a missed update in the future. The other places where a path list (struct type_and_oid_list) is initialized is for the following "fixed" lists: * Tag objects. * Commit objects. * Root trees. * Tagged trees. * Tagged blobs. These lists are created and consumed in different ways, with only the root trees being passed into the logic that cares about the "maybe_interesting" bit. It is appropriate to keep these uses separate. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
febb9d87df |
path-walk: fix setup of pending objects
Users reported an issue where objects were missing from their local
repositories after a full repack using 'git repack -adf --path-walk'.
This was alarming and took a while to create a reproducer. Here, we fix
the bug and include a test case that would fail without this fix.
The root cause is that certain objects existed in the index and had no
second versions. These objects are usually blobs, though trees can be
included if a cache-tree exists. The issue is that the revision walk
adds these objects to the "pending" list and the path-walk API forgets
to mark the lists it creates at this point as "maybe_interesting". If
these paths only ever have a single version in the history of the repo
(including the current staged version) then the parent directory never
tries to add a new object to the list and mark the list as
"maybe_interesting". Thus, when walking the list later, the group is
skipped as it is expected that no objects are interesting. This happens
even when there are actually no UNINTERESTING objects at all! This is
based on the optimization enabled by the pack.useSparse=true config
option, which is the default.
Thus, we create a test case that demonstrates the many cases of this
issue for reproducibility:
1. File a/b/c has only one committed version.
2. Files a/i and x/y only exist as staged changes.
3. Tree x/ only exists in the cache-tree.
After performing a non-path-walk repack to force all loose objects into
packfiles, run a --path-walk repack followed by 'git fsck'. This fsck is
what fails with the following errors:
error: invalid object 100644 f2e41136... for 'a/b/c'
This is the dropped instance of the single-versioned a/b/c file.
broken link from tree cfda31d8...
to tree 3f725fcd...
This is the missing tree for the single-versioned a/b/ directory.
missing blob 0ddf2bae... (a/i)
missing blob 975fbec8... (x/y)
missing blob a60d869d... (file)
missing blob f2e41136... (a/b/c)
missing tree 3f725fcd... (a/b/)
dangling tree 5896d7e... (staged root tree)
Note that since the staged root tree is missing, the fsck output cannot
even report that the staged x/ tree is missing as well.
The core problem here is that the "maybe_interesting" member of 'struct
type_and_oid_list' is not initialized to '1'. This member was added in
|
||
|
|
4705889c3d |
path-walk: add new 'edge_aggressive' option
In preparation for allowing both the --shallow and --path-walk options in the 'git pack-objects' builtin, create a new 'edge_aggressive' option in the path-walk API. This option will help walk the boundary more thoroughly and help avoid sending extra objects during fetches and pushes. The only use of the 'edge_hint_aggressive' option in the revision API is within mark_edges_uninteresting(), which is usually called before between prepare_revision_walk() and before visiting commits with get_revision(). In prepare_revision_walk(), the UNINTERESTING commits are walked until a boundary is found. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
bff4555767 |
backfill: add --sparse option
One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensive as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Non-cone mode can describe the included files using both positive and negative patterns, which changes the possible return values of path_matches_pattern_list(). Test both kinds of matches for increased coverage. To test this, we can create a blobless sparse clone, expand the sparse-checkout slightly, and then run 'git backfill --sparse' to see how much data is downloaded. The general steps are 1. git clone --filter=blob:none --sparse <url> 2. git sparse-checkout set <dir1> ... <dirN> 3. git backfill --sparse For the Git repository with the 'builtin' directory in the sparse-checkout, we get these results for various batch sizes: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|-------| | (Initial clone) | 3 | 110 MB | | | 10K | 12 | 192 MB | 17.2s | | 15K | 9 | 192 MB | 15.5s | | 20K | 8 | 192 MB | 15.5s | | 25K | 7 | 192 MB | 14.7s | This case matters less because a full clone of the Git repository from GitHub is currently at 277 MB. Using a copy of the Linux repository with the 'kernel/' directory in the sparse-checkout, we get these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|------| | (Initial clone) | 2 | 1,876 MB | | | 10K | 11 | 2,187 MB | 46s | | 25K | 7 | 2,188 MB | 43s | | 50K | 5 | 2,194 MB | 44s | | 100K | 4 | 2,194 MB | 48s | This case is more meaningful because a full clone of the Linux repository is currently over 6 GB, so this is a valuable way to download a fraction of the repository and no longer need network access for all reachable objects within the sparse-checkout. Choosing a batch size will depend on a lot of factors, including the user's network speed or reliability, the repository's file structure, and how many versions there are of the file within the sparse-checkout scope. There will not be a one-size-fits-all solution. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
b224e8e36c |
path-walk: drop redundant parse_tree() call
This call to parse_tree() was flagged by Coverity for ignoring the return value. But if we look a little further up the function, we can see that there is already a call to parse_tree_gently(), and we'll return early if that fails. So by this point the tree will always be parsed, and the call is redundant. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
71edf6c3c8 |
path-walk: reorder object visits
The path-walk API currently uses a stack-based approach to recursing through the list of paths within the repository. This guarantees that after a tree path is explored, all paths contained within that tree path will be explored before continuing to explore siblings of that tree path. The initial motivation of this depth-first approach was to minimize memory pressure while exploring the repository. A breadth-first approach would have too many "active" paths being stored in the paths_to_lists map. We can take this approach one step further by making sure that blob paths are visited before tree paths. This allows the API to free the memory for these blob objects before continuing to perform the depth-first search. This modifies the order in which we visit siblings, but does not change the fact that we are performing depth-first search. To achieve this goal, use a priority queue with a custom sorting method. The sort needs to handle tags, blobs, and trees (commits are handled slightly differently). When objects share a type, we can sort by path name. This will keep children of the latest path to leave the stack be preferred over the rest of the paths in the stack, since they agree in prefix up to and including a directory separator. When the types are different, we can prefer tags over other types and blobs over trees. This causes significant adjustments to t6601-path-walk.sh to rearrange the order of the visited paths. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
6333e7ae0b |
path-walk: mark trees and blobs as UNINTERESTING
When the input rev_info has UNINTERESTING starting points, we want to be sure that the UNINTERESTING flag is passed appropriately through the objects. To match how this is done in places such as 'git pack-objects', we use the mark_edges_uninteresting() method. This method has an option for using the "sparse" walk, which is similar in spirit to the path-walk API's walk. To be sure to keep it independent, add a new 'prune_all_uninteresting' option to the path_walk_info struct. To check how the UNINTERSTING flag is spread through our objects, extend the 'test-tool path-walk' command to output whether or not an object has that flag. This changes our tests significantly, including the removal of some objects that were previously visited due to the incomplete implementation. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
9145660979 |
path-walk: visit tags and cached objects
The rev_info that is specified for a path-walk traversal may specify visiting tag refs (both lightweight and annotated) and also may specify indexed objects (blobs and trees). Update the path-walk API to walk these objects as well. When walking tags, we need to peel the annotated objects until reaching a non-tag object. If we reach a commit, then we can add it to the pending objects to make sure we visit in the commit walk portion. If we reach a tree, then we will assume that it is a root tree. If we reach a blob, then we have no good path name and so add it to a new list of "tagged blobs". When the rev_info includes the "--indexed-objects" flag, then the pending set includes blobs and trees found in the cache entries and cache-tree. The cache entries are usually blobs, though they could be trees in the case of a sparse index. The cache-tree stores previously-hashed tree objects but these are cleared out when staging objects below those paths. We add tests that demonstrate this. The indexed objects come with a non-NULL 'path' value in the pending item. This allows us to prepopulate the 'path_to_lists' strmap with lists for these paths. The tricky thing about this walk is that we will want to combine the indexed objects walk with the commit walk, especially in the future case of walking objects during a command like 'git repack'. Whenever possible, we want the objects from the index to be grouped with similar objects in history. We don't want to miss any paths that appear only in the index and not in the commit history. Thus, we need to be careful to let the path stack be populated initially with only the root tree path (and possibly tags and tagged blobs) and go through the normal depth-first search. Afterwards, if there are other paths that are remaining in the paths_to_lists strmap, we should then iterate through the stack and visit those objects recursively. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
c8dba310d7 |
path-walk: allow consumer to specify object types
We add the ability to filter the object types in the path-walk API so the callback function is called fewer times. This adds the ability to ask for the commits in a list, as well. We re-use the empty string for this set of objects because these are passed directly to the callback function instead of being part of the 'path_stack'. Future changes will add the ability to visit annotated tags. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
9d46bc791b |
path-walk: introduce an object walk by path
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.
These batches are collected in 'struct type_and_oid_list' objects, which
store an object type and an oid_array of objects.
The data structures are documented in 'struct path_walk_context', but in
summary the most important are:
* 'paths_to_lists' is a strmap that connects a path to a
type_and_oid_list for that path. To avoid conflicts in path names,
we make sure that tree paths end in "/" (except the root path with
is an empty string) and blob paths do not end in "/".
* 'path_stack' is a string list that is added to in an append-only
way. This stores the stack of our depth-first search on the heap
instead of using recursion.
* 'path_stack_pushed' is a strmap that stores path names that were
already added to 'path_stack', to avoid repeating paths in the
stack. Mostly, this saves us from quadratic lookups from doing
unsorted checks into the string_list.
The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
push_to_stack() method. Call this instead of inserting into these
structures directly.
The walk_objects_by_path() method initializes these structures and
starts walking commits from the given rev_info struct. The commits are
used to find the list of root trees which populate the start of our
depth-first search.
The core of our depth-first search is in a while loop that continues
while we have not indicated an early exit and our 'path_stack' still has
entries in it. The loop body pops a path off of the stack and "visits"
the path via the walk_path() method.
The walk_path() method gets the list of OIDs from the 'path_to_lists'
strmap and executes the callback method on that list with the given path
and type. If the OIDs correspond to tree objects, then iterate over all
trees in the list and run add_children() to add the child objects to
their own lists, adding new entries to the stack if necessary.
In testing, this depth-first search approach was the one that used the
least memory while iterating over the object lists. There is still a
chance that repositories with too-wide path patterns could cause memory
pressure issues. Limiting the stack size could be done in the future by
limiting how many objects are being considered in-progress, or by
visiting blob paths earlier than trees.
There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|