git-for-windows/git - git - Gitea: Self-hosted GitHub

mirror of https://github.com/git-for-windows/git.git synced 2026-06-15 09:41:09 -05:00

Author	SHA1	Message	Date
Taylor Blau	456efac53b	path-walk: support `combine` filter The `combine` filter takes the intersection of its children, that is: objects are shown only when all child filters would admit the object. The preceding patches added support for many individual filter types. Enable users to compose these filters by implementing support for the `combine` filter type. Mapping intersection onto path_walk_info works because every supported child filter is a monotonic restriction: - `blob:none`, `tree:0` unconditionally clear `info->blobs` and (for `tree:0`) `info->trees`; clearing an already-cleared flag is a no-op. - `object:type=X` is now expressed as an AND of each type flag with the filtered type, so applying multiple such filters only refines the existing set rather than overwrites it. - `blob:limit=N` has to compose too: the intersection of "size < L1" and "size < L2" is "size < min(L1, L2)". Update the `LOFC_BLOB_LIMIT` handler to take the running minimum when `info->blob_limit` is already set, so a combined filter with, e.g., both "blob:limit=10" and "blob:limit=5" produces a limit of 5 regardless of ordering. - `sparse:oid` is left unchanged. A `combine` filter that includes a `sparse:oid` is allowed at most once, since the existing handler refuses to overwrite `info->pl`. Two `sparse:oid` filters in a single `combine` would be unusual and are rejected with a warning, matching the standalone `sparse:oid` behavior. Implementation-wise, the existing `prepare_filters()` called `list_objects_filter_release()` inside each case branch. That works fine for top-level filters, but `combine` filters need to recurse over its child filters without releasing each one in turn (since the parent's release iterates the sub array). Split `prepare_filters()` into a recursive helper that performs only the mutation, plus a thin wrapper that calls the helper and then releases the top-level filter once. The `LOFC_COMBINE` case in the helper just walks `sub_nr` and recurses; child filters are released by the wrapper's single `list_objects_filter_release()` call on the parent (which itself recursively releases each sub-filter, the same way it always has). If any sub-filter is unsupported (e.g. "tree:1", "sparse:<path>", or a not-yet-supported choice), the recursion bubbles a failure up and the existing pack-objects/backfill fallback paths kick in. Add coverage in t6601: - "combine:blob:none+tree:0" collapses to "tree:0" - "combine:object:type=blob+blob:limit=3" yields only the blobs smaller than three bytes - "combine:object:type=blob+object:type=tree" intersects to empty - "combine:tree:1+blob:none" reports the "tree:1" error. Update Documentation/git-pack-objects.adoc to add combine to the list of supported --filter forms. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-05-24 18:41:07 +09:00
Taylor Blau	2b8d07ef91	path-walk: support `object:type` filter The `object:type` filter accepts only objects of a single type; it is the second member of the object-info-only filter family that bitmap traversal already supports. Like `blob:none` and `tree:0`, it can be evaluated with nothing more than the object's type, which is exactly the granularity path-walk's existing info->{commits,trees,blobs,tags} flags already control. Map `LOFC_OBJECT_TYPE` in `prepare_filters()` by AND-ing each flag against the filtered type. A single `object:type=X` filter applied to the default info (all flags = 1) leaves `info->X = 1` and all the others 0, which is what we want. Using an AND rather than straight assignment prepares us for a subsequent change to implement combined object filters. The path-walk machinery is mostly already wired for the per-type distinction: - `walk_path()` calls `path_fn` for a batch only when the corresponding `info->X` flag is set, so unwanted types are silently not reported. - `add_tree_entries()` skips tree entries of type `OBJ_BLOB` when `info->blobs` is unset, so we don't even allocate paths for them. - The commit-walk loop short-circuits the root-tree fetch when `!info->trees && !info->blobs`, so commit-only filters don't descend into trees at all. But there are a couple of side effects of the "trees off, blobs on" case that need fixing: 1. 'setup_pending_objects()' previously skipped pending trees as soon as `info->trees` was zero. For 'object:type=blob' the call site needs those pending trees: a lightweight tag pointing to a tree, or an annotated tag whose peeled target is a tree, can both reach blobs that are otherwise unreachable from any commit's root tree. Loosen the gate to "if (!info->trees && !info->blobs) continue" and similarly retrieve the root_tree_list whenever either trees or blobs are wanted. 2. The revision machinery's `handle_commit()` drops pending trees when `revs->tree_objects` is zero (see the 'OBJ_TREE' handler in revision.c), so by the time path-walk sees the pending list after `prepare_revision_walk()` the tree-bearing pendings would already be gone. Fix this by setting revs->tree_objects = info->trees \|\| info->blobs so pending trees survive `prepare_revision_walk()` whenever we need to walk into them. Path-walk still resets tree_objects to zero immediately after `prepare_revision_walk()` returns, so the rev-walk itself never enumerates trees redundantly with path-walk's own descent. Add coverage in t6601 for each of the four `object:type` values. The 'object:type=blob' test in particular asserts that file2 and child/file (both reachable only through tag-pointed trees) show up in the output, exercising the pending-tree fix. Update Documentation/git-pack-objects.adoc to add object:type to the list of supported --filter forms. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-05-24 18:41:07 +09:00
Taylor Blau	5111520e2a	path-walk: support `tree:0` filter The `tree:0` object filter omits all trees and blobs from the result, keeping only commits and tags. Consequently, this filter type should has a fairly straightforward integration with path-walk, as the decision to include an object depends only on its type and does not depend on any path-sensitive state. Mapping it onto `path_walk_info` is direct: set `info->trees = 0` and `info->blobs = 0` in `prepare_filters()` when the `LOFC_TREE_DEPTH` choice is requested with depth zero. The existing code already plumbs those flags through the rest of the walk: - 'walk_objects_by_path()' sets `revs->blob_objects = info->blobs` and `revs->tree_objects = info->trees` before `prepare_revision_walk()`, so the revision walk doesn't try to enumerate trees or blobs itself. - The commit-walk loop short-circuits the root-tree fetch with "if (!info->trees && !info->blobs) continue;", so we never even look up the root tree, let alone descend into it. - `setup_pending_objects()` skips pending trees and blobs based on the same flags. This means the path-walk doesn't allocate or expand any tree structures at all under `tree:0`, which matches the intended behavior of the filter. However, this requires first fixing some issues with how the path-walk API handles directly-requested trees _and_ trees requested through lightweight tags. These changes create substantial updates to t6601-path-walk.sh, which the previous change highlighted as a problem by tagging otherwise-unreachable trees and having them not appear in the output. Non-zero tree-depth filters are not supported. Those depend on the depth at which a tree is visited, which is a path-walk concept the filter machinery doesn't currently share with the path-walk API. Reject them in `prepare_filters()` with a helpful error and let pack-objects fall back to the regular traversal, the same way it already does for unsupported filters. Add coverage in t6601 for both `--all` and a single-branch case to confirm that no trees or blobs are emitted, and a separate test that `tree:1` is rejected with the expected error message. Place the new tests before "setup sparse filter blob" so they run on the original set of refs, before the orphan branch that the sparse-tree tests create. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-05-24 18:41:07 +09:00
Derrick Stolee	2dc858e69e	pack-objects: support sparse:oid filter with path-walk The --filter=sparse:<oid> option to 'git pack-objects' allows focusing an object set to a sparse-checkout definition. This reduces the set of matching blobs while retaining all reachable trees. No server currently supports fetching with this filter because it is expensive to compute and reachability bitmaps do not help without a significant effort to extend the bitmap feature to store bitmaps for each supported sparse- checkout definition. Without focusing on serving fetches and clones with these filters, there are still benefits that could be realized by making this faster. With the sparse index, it's more realistic now than ever to be able to operate a local clone that was bootstrapped by a packfile created with a sparse filter, because the missing trees are not needed to move a sparse-checkout from one commit to another or to view the history of any path in scope. Such clones could perhaps be bootstrapped by partial bundles. Previously, constructing these sparse packs has been incredibly computationally inefficient. The revision walk that explores which objects are in scope spends a lot of time checking each object to see if it matches the sparse-checkout patterns, causing quadratic behavior (number of objects times number of sparse-checkout patterns). This improves somewhat when using cone-mode sparse-checkout patterns that can use hashtables and prefix matches to determine containment. However, the check per object is still too expensive for most cases. This is where the path-walk feature comes in. We can proceed as normal by placing objects in bins by path and _then_ check a group of objects all at once. Since sparse:<oid> only restricts blobs, the path-walk must include all reachable trees while using the cone-mode patterns to skip blobs at paths outside the sparse scope. This establishes a baseline for a potential future "treesparse:<oid>" filter that would also restrict trees, but introducing such a new filter is deferred to a later change. The implementation here is focused around loading the sparse-checkout patterns from the provided object ID and checking that the patterns are indeed cone-mode patterns. We can then load the correct pattern list into the path walk context and use the logic that already exists from `bff4555767` (backfill: add --sparse option, 2025-02-03), though that feature loads sparse-checkout patterns from the worktree's local settings and also restricts tree objects. We use a combination of errors and warnings to signal problems during this load. The difference is that errors are likely fatal for the non-path-walk version while the warnings are probably just implementation details for the path-walk version and the 'git pack-objects' command can fall back to the revision walk version. Now that the SEEN flag is deferred until after pattern checks (from the previous commit), handle the case where a tree with a shared OID appears at both an out-of-cone and in-cone path. When trees are not being pruned (pl_sparse_trees == 0), the path-walk re-walks the tree at the in-cone path so that in-cone blobs within it are discovered. The new tests in t5317 and t6601 demonstrate this behavior and would fail without these changes. The performance test p5315 shows the impact of this change when using sparse filters: Test HEAD~1 HEAD ---------------------------------------------------------------------- 5315.10: repack (sparse:oid) 77.98 77.47 -0.7% 5315.11: repack size (sparse:oid) 187.5M 187.4M -0.0% 5315.12: repack (sparse:oid, --path-walk) 77.91 31.41 -59.7% 5315.13: repack size (sparse:oid, --path-walk) 187.5M 161.1M -14.1% These performance tests were run on the Git repository. The --path-walk feature shows meaningful space savings (14% smaller for sparse packs) and dramatic time savings (60% faster) by leveraging the path-walk's ability to skip blobs outside the sparse scope. Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blaue <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-05-24 18:41:06 +09:00
Derrick Stolee	8ff8de7616	path-walk: add pl_sparse_trees to control tree pruning The path-walk API prunes trees and blobs when a sparse-checkout pattern list is provided, which is the correct behavior for 'git backfill --sparse' since it only needs to fill in objects at paths within the sparse cone. However, a future change will use the path-walk API with a sparse:<oid> filter that restricts only blobs while retaining all reachable trees. To support both behaviors, add a 'pl_sparse_trees' flag to path_walk_info. When set (as in 'git backfill --sparse' and the --stdin-pl test helper mode), the sparse patterns prune both trees and blobs. When unset, only blobs are filtered and all trees are walked and reported. Additionally, move the SEEN flag assignment in add_tree_entries() to after the sparse pattern and pathspec checks. Previously, SEEN was set immediately upon discovering an object, before checking whether its path matched the sparse patterns. When the same object ID appeared at multiple paths (e.g. sibling directories with identical contents), the first path to be visited would mark the object as SEEN. If that path was outside the sparse cone, the object would be skipped there but also never discovered at its in-cone path. By deferring the SEEN flag until after the checks pass, objects that are skipped due to sparse filtering remain discoverable at other paths where they may be in scope. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-05-24 18:41:06 +09:00
Derrick Stolee	f1b5d3da16	path-walk: support blob size limit filter Extend the path-walk API to handle the 'blob:limit=<size>' object filter natively. This filter omits blobs whose size is equal to or greater than the given limit, matching the semantics used by the list-objects-filter machinery. When revs->filter.choice is LOFC_BLOB_LIMIT, the prepare_filters() method stores the limit value in info->blob_limit and clears the filter from revs. If the limit is zero, this degenerates to blob:none (all blobs excluded), so info->blobs is set to 0 instead. During walk_path(), blob batches are filtered before being delivered to the callback: each blob's size is checked via odb_read_object_info(), and only blobs strictly smaller than the limit are included. Blobs whose size cannot be determined (e.g. missing in a partial clone) are conservatively included, matching the existing filter behavior. Empty batches after filtering are skipped entirely. The check for inclusion in the path batch looks a little strange at first glance. We use odb_read_object_info() to read the object's size. Based on all of the assumptions to this point, this _should_ return OBJ_BLOB. Since we are focused on the size filter, we use a short-circuited OR (\|\|) to skip the size check if that method returns a different object type. Notice that this inspection of object sizes requires the content to be present in the repository. The odb_read_object_info() call will download a missing blob on-demand. This means that the use of the path-walk API within 'git backfill' would not operate nicely with this filter type. The intention of that command is to download missing blobs in batches. Downloading objects one-by-one would go against the point. Update the validation in 'git backfill' to add its own compatibility check on top of path_walk_filter_compatible(). Add tests for blob:limit=0 (equivalent to blob:none) and blob:limit=3 (which exercises partial filtering within a batch where some blobs are kept and others are excluded). Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-05-24 18:41:06 +09:00
Derrick Stolee	6d87f0e8a3	path-walk: support blobless filter The 'git pack-objects' command can opt-in to using the path-walk API for scanning the objects. Currently, this option is dynamically disabled if combined with '--filter=<X>', even when using a simple filter such as 'blob:none' to signal a blobless packfile. This is a common scenario for repos at scale, so is worth integrating. Also, users can opt-in to the '--path-walk' option by default through the pack.usePathWalk=true config option. When using that in a blobless partial clone, the following warning can appear even though the user did not specify either option directly: warning: cannot use --filter with --path-walk Teach the path-walk API to handle the 'blob:none' object filter natively. When revs->filter.choice is LOFC_BLOB_NONE, the path-walk sets info->blobs to 0 (skipping all blob objects) and clears the filter from revs so that prepare_revision_walk() does not reject the configuration. This check is implemented in the static prepare_filters() method, which will simultaneously check if the input filters are compatible and will make the appropriate mutations to the path_walk_info and filters if the path_walk_info is non-NULL. This allows us to use this logic both in the API method path_walk_filter_compatible() for use in builtin/pack-objects.c and as a prep step in walk_objects_by_path(). Update the test helper (test-path-walk) to accept --filter=<spec> as a test-tool option (before '--'), applying it to revs after setup_revisions() to avoid the --objects requirement check. We can also revert recent GIT_TEST_PACK_PATH_WALK overrides in t5620. Also switch test-path-walk from REV_INFO_INIT with manual repo assignment to repo_init_revisions(), which properly initializes the filter_spec strbuf needed for filter parsing. Add tests for blob:none with --all and with a single branch. The performance test p5315 shows the impact of this change when using blobless filters: Test HEAD~1 HEAD --------------------------------------------------------------------- 5315.6: repack (blob:none) 13.53 13.87 +2.5% 5315.7: repack size (blob:none) 137.7M 137.8M +0.1% 5315.8: repack (blob:none, --path-walk) 13.51 23.43 +73.4% 5315.9: repack size (blob:none, --path-walk) 137.7M 115.2M -16.3% These performance tests were run on the Git repository. The --path-walk feature shows meaningful space savings (16% smaller for blobless packs) at the cost of increased computation time due to the two compression passes. This data demonstrates that the feature is engaged and provides real compression benefits when --no-reuse-delta forces fresh deltas. Co-Authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-05-24 18:41:06 +09:00
Derrick Stolee	7a7070eebc	path-walk: always emit directly-requested objects We are preparing to integrate the path-walk API with some --filter options in 'git pack-objects', but there is a subtle issue that is revealed when those are put together and the test suite is run with GIT_TEST_PACK_PATH_WALK=1. When a filter reduces the set of requested objects, this results in filtering out directly-requested objects, such as in the download of needed blobs in a blobless partial clone. The root cause is that the scan of pending objects in the path-walk API respects the filters set in the path_walk_info instead of overriding them for pending objects. We can tell that a path is part of the directly-referenced objects if its path name starts with '/' (other paths, including root trees never have this starting character). Create a path_is_for_direct_objects() to make this meaning clear, especially as we add more references in the future as we integrate the path-walk API with partial clone filter options. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-05-24 18:41:06 +09:00
Junio C Hamano	59063fe8b4	Merge branch 'yc/path-walk-fix-error-reporting' The value of a wrong pointer variable was referenced in an error message that reported that it shouldn't be NULL. * yc/path-walk-fix-error-reporting: path-walk: fix NULL pointer dereference in error message	2026-04-07 14:59:26 -07:00
Derrick Stolee	3f20c21a1c	path-walk: support wildcard pathspecs for blob filtering Previously, walk_objects_by_path() silently ignored pathspecs containing wildcards or magic by clearing them. This caused all blobs to be downloaded regardless of the given pathspec. Wildcard pathspecs like "d/file.*.txt" are useful for narrowing which blobs to process (e.g., during 'git backfill'). Support wildcard pathspecs by making two changes: 1. Add an 'exact_pathspecs' flag to path_walk_context. When the pathspec has no wildcards or magic, set this flag and use the existing fast-path prefix matching in add_tree_entries(). When wildcards are present, skip that block since prefix matching cannot handle glob patterns. 2. Add a match_pathspec() check in walk_path() to filter out blobs whose full path does not match the pathspec. This provides the actual blob-level filtering for wildcard pathspecs. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-03-26 09:38:07 -07:00
Derrick Stolee	7be182045a	backfill: work with prefix pathspecs The previous change allowed specifying revision arguments over the 'git backfill' command-line. This created the opportunity for restricting the initial commit set by filtering the revision walk through a pathspec. Other than filtering the commit set (and thereby the root trees), this did not restrict the path-walk implementation of 'git backfill' and did not restrict the blobs that were downloaded to only those matching the pathspec. Update the path-walk API to accept certain kinds of pathspecs and to silently ignore anything too complex, for now. We will update this in the next change to properly restrict to even complex pathspecs. The current behavior focuses on pathspecs that match paths exactly. This includes exact filenames, including directory names as prefixes. Pathspecs containing wildcards or magic are cleared so the path walk downloads all blobs, as before. The reason for this restriction is to allow for a faster execution by pruning the path walk to only trees that could contribute towards one of those paths as a parent directory. The test directory 'd/f/' (next to 'd/file*.txt') was prepared in a previous commit to exercise the subtlety in prefix matching. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-03-26 09:38:06 -07:00
Yuvraj Singh Chauhan	753ecf4205	path-walk: fix NULL pointer dereference in error message When lookup_tree() or lookup_blob() cannot find a tree entry's object, 'o' is set to NULL via: o = child ? &child->object : NULL; The subsequent null-check catches this correctly, but then dereferences 'o' to format the error message: error(_("failed to find object %s"), oid_to_hex(&o->oid)); This causes a segfault instead of the intended diagnostic output. Fix this by using &entry.oid instead. 'entry' is the struct name_entry populated by tree_entry() on each loop iteration and holds the OID of the failing lookup -- which is exactly what the error should report. This crash is reachable via git-backfill(1) when a tree entry's object is absent from the local object database. Signed-off-by: Yuvraj Singh Chauhan <ysinghcin@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-03-20 10:48:34 -07:00
René Scharfe	d473865154	path-walk: use repo_parse_tree_gently() Use the passed in repository instead of the implicit the_repository when parsing the tree. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-01-09 18:36:17 -08:00
Derrick Stolee	93afe9b060	path-walk: create initializer for path lists The previous change fixed a bug in 'git repack -adf --path-walk' that was due to an update to how path lists are initialized and missing some important cases when processing the pending objects. This change takes the three critical places where path lists are initialized and combines them into a static method. This simplifies the callers somewhat while also helping to avoid a missed update in the future. The other places where a path list (struct type_and_oid_list) is initialized is for the following "fixed" lists: * Tag objects. * Commit objects. * Root trees. * Tagged trees. * Tagged blobs. These lists are created and consumed in different ways, with only the root trees being passed into the logic that cares about the "maybe_interesting" bit. It is appropriate to keep these uses separate. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-08-25 09:01:17 -07:00
Derrick Stolee	febb9d87df	path-walk: fix setup of pending objects Users reported an issue where objects were missing from their local repositories after a full repack using 'git repack -adf --path-walk'. This was alarming and took a while to create a reproducer. Here, we fix the bug and include a test case that would fail without this fix. The root cause is that certain objects existed in the index and had no second versions. These objects are usually blobs, though trees can be included if a cache-tree exists. The issue is that the revision walk adds these objects to the "pending" list and the path-walk API forgets to mark the lists it creates at this point as "maybe_interesting". If these paths only ever have a single version in the history of the repo (including the current staged version) then the parent directory never tries to add a new object to the list and mark the list as "maybe_interesting". Thus, when walking the list later, the group is skipped as it is expected that no objects are interesting. This happens even when there are actually no UNINTERESTING objects at all! This is based on the optimization enabled by the pack.useSparse=true config option, which is the default. Thus, we create a test case that demonstrates the many cases of this issue for reproducibility: 1. File a/b/c has only one committed version. 2. Files a/i and x/y only exist as staged changes. 3. Tree x/ only exists in the cache-tree. After performing a non-path-walk repack to force all loose objects into packfiles, run a --path-walk repack followed by 'git fsck'. This fsck is what fails with the following errors: error: invalid object 100644 f2e41136... for 'a/b/c' This is the dropped instance of the single-versioned a/b/c file. broken link from tree cfda31d8... to tree 3f725fcd... This is the missing tree for the single-versioned a/b/ directory. missing blob 0ddf2bae... (a/i) missing blob 975fbec8... (x/y) missing blob a60d869d... (file) missing blob f2e41136... (a/b/c) missing tree 3f725fcd... (a/b/) dangling tree 5896d7e... (staged root tree) Note that since the staged root tree is missing, the fsck output cannot even report that the staged x/ tree is missing as well. The core problem here is that the "maybe_interesting" member of 'struct type_and_oid_list' is not initialized to '1'. This member was added in `6333e7ae0b` (path-walk: mark trees and blobs as UNINTERESTING, 2024-12-20) in a way to help when creating packfiles for a small commit range using the sparse path algorithm (enabled by pack.useSparse=true). The idea here is that the list is marked as "maybe_interesting" if an object is added that does not have the UNINTERESTING flag on it. Later, this is checked again in case all objects in the list were marked UNINTERESTING after that point in time. In this case, the algorithm skips the list as there is no reason to visit it. This leads to the problem where the "maybe_interesting" member was not appropriately initialized when the list is created from pending objects. Initializing this in the correct places fixes the bug. To reduce risk of similar bugs around initializing this structure, a follow-up change will make initializing lists use a shared method. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-08-25 09:01:17 -07:00
Derrick Stolee	4705889c3d	path-walk: add new 'edge_aggressive' option In preparation for allowing both the --shallow and --path-walk options in the 'git pack-objects' builtin, create a new 'edge_aggressive' option in the path-walk API. This option will help walk the boundary more thoroughly and help avoid sending extra objects during fetches and pushes. The only use of the 'edge_hint_aggressive' option in the revision API is within mark_edges_uninteresting(), which is usually called before between prepare_revision_walk() and before visiting commits with get_revision(). In prepare_revision_walk(), the UNINTERESTING commits are walked until a boundary is found. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-05-16 12:15:40 -07:00
Derrick Stolee	bff4555767	backfill: add --sparse option One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensive as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Non-cone mode can describe the included files using both positive and negative patterns, which changes the possible return values of path_matches_pattern_list(). Test both kinds of matches for increased coverage. To test this, we can create a blobless sparse clone, expand the sparse-checkout slightly, and then run 'git backfill --sparse' to see how much data is downloaded. The general steps are 1. git clone --filter=blob:none --sparse <url> 2. git sparse-checkout set <dir1> ... <dirN> 3. git backfill --sparse For the Git repository with the 'builtin' directory in the sparse-checkout, we get these results for various batch sizes: \| Batch Size \| Pack Count \| Pack Size \| Time \| \|-----------------\|------------\|-----------\|-------\| \| (Initial clone) \| 3 \| 110 MB \| \| \| 10K \| 12 \| 192 MB \| 17.2s \| \| 15K \| 9 \| 192 MB \| 15.5s \| \| 20K \| 8 \| 192 MB \| 15.5s \| \| 25K \| 7 \| 192 MB \| 14.7s \| This case matters less because a full clone of the Git repository from GitHub is currently at 277 MB. Using a copy of the Linux repository with the 'kernel/' directory in the sparse-checkout, we get these results: \| Batch Size \| Pack Count \| Pack Size \| Time \| \|-----------------\|------------\|-----------\|------\| \| (Initial clone) \| 2 \| 1,876 MB \| \| \| 10K \| 11 \| 2,187 MB \| 46s \| \| 25K \| 7 \| 2,188 MB \| 43s \| \| 50K \| 5 \| 2,194 MB \| 44s \| \| 100K \| 4 \| 2,194 MB \| 48s \| This case is more meaningful because a full clone of the Linux repository is currently over 6 GB, so this is a valuable way to download a fraction of the repository and no longer need network access for all reachable objects within the sparse-checkout. Choosing a batch size will depend on a lot of factors, including the user's network speed or reliability, the repository's file structure, and how many versions there are of the file within the sparse-checkout scope. There will not be a one-size-fits-all solution. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-02-03 16:12:42 -08:00
Jeff King	b224e8e36c	path-walk: drop redundant parse_tree() call This call to parse_tree() was flagged by Coverity for ignoring the return value. But if we look a little further up the function, we can see that there is already a call to parse_tree_gently(), and we'll return early if that fails. So by this point the tree will always be parsed, and the call is redundant. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-01-22 17:52:44 -08:00
Derrick Stolee	71edf6c3c8	path-walk: reorder object visits The path-walk API currently uses a stack-based approach to recursing through the list of paths within the repository. This guarantees that after a tree path is explored, all paths contained within that tree path will be explored before continuing to explore siblings of that tree path. The initial motivation of this depth-first approach was to minimize memory pressure while exploring the repository. A breadth-first approach would have too many "active" paths being stored in the paths_to_lists map. We can take this approach one step further by making sure that blob paths are visited before tree paths. This allows the API to free the memory for these blob objects before continuing to perform the depth-first search. This modifies the order in which we visit siblings, but does not change the fact that we are performing depth-first search. To achieve this goal, use a priority queue with a custom sorting method. The sort needs to handle tags, blobs, and trees (commits are handled slightly differently). When objects share a type, we can sort by path name. This will keep children of the latest path to leave the stack be preferred over the rest of the paths in the stack, since they agree in prefix up to and including a directory separator. When the types are different, we can prefer tags over other types and blobs over trees. This causes significant adjustments to t6601-path-walk.sh to rearrange the order of the visited paths. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-12-20 08:37:05 -08:00
Derrick Stolee	6333e7ae0b	path-walk: mark trees and blobs as UNINTERESTING When the input rev_info has UNINTERESTING starting points, we want to be sure that the UNINTERESTING flag is passed appropriately through the objects. To match how this is done in places such as 'git pack-objects', we use the mark_edges_uninteresting() method. This method has an option for using the "sparse" walk, which is similar in spirit to the path-walk API's walk. To be sure to keep it independent, add a new 'prune_all_uninteresting' option to the path_walk_info struct. To check how the UNINTERSTING flag is spread through our objects, extend the 'test-tool path-walk' command to output whether or not an object has that flag. This changes our tests significantly, including the removal of some objects that were previously visited due to the incomplete implementation. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-12-20 08:37:05 -08:00
Derrick Stolee	9145660979	path-walk: visit tags and cached objects The rev_info that is specified for a path-walk traversal may specify visiting tag refs (both lightweight and annotated) and also may specify indexed objects (blobs and trees). Update the path-walk API to walk these objects as well. When walking tags, we need to peel the annotated objects until reaching a non-tag object. If we reach a commit, then we can add it to the pending objects to make sure we visit in the commit walk portion. If we reach a tree, then we will assume that it is a root tree. If we reach a blob, then we have no good path name and so add it to a new list of "tagged blobs". When the rev_info includes the "--indexed-objects" flag, then the pending set includes blobs and trees found in the cache entries and cache-tree. The cache entries are usually blobs, though they could be trees in the case of a sparse index. The cache-tree stores previously-hashed tree objects but these are cleared out when staging objects below those paths. We add tests that demonstrate this. The indexed objects come with a non-NULL 'path' value in the pending item. This allows us to prepopulate the 'path_to_lists' strmap with lists for these paths. The tricky thing about this walk is that we will want to combine the indexed objects walk with the commit walk, especially in the future case of walking objects during a command like 'git repack'. Whenever possible, we want the objects from the index to be grouped with similar objects in history. We don't want to miss any paths that appear only in the index and not in the commit history. Thus, we need to be careful to let the path stack be populated initially with only the root tree path (and possibly tags and tagged blobs) and go through the normal depth-first search. Afterwards, if there are other paths that are remaining in the paths_to_lists strmap, we should then iterate through the stack and visit those objects recursively. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-12-20 08:37:05 -08:00
Derrick Stolee	c8dba310d7	path-walk: allow consumer to specify object types We add the ability to filter the object types in the path-walk API so the callback function is called fewer times. This adds the ability to ask for the commits in a list, as well. We re-use the empty string for this set of objects because these are passed directly to the callback function instead of being part of the 'path_stack'. Future changes will add the ability to visit annotated tags. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-12-20 08:37:05 -08:00
Derrick Stolee	9d46bc791b	path-walk: introduce an object walk by path In anticipation of a few planned applications, introduce the most basic form of a path-walk API. It currently assumes that there are no UNINTERESTING objects, and does not include any complicated filters. It calls a function pointer on groups of tree and blob objects as grouped by path. This only includes objects the first time they are discovered, so an object that appears at multiple paths will not be included in two batches. These batches are collected in 'struct type_and_oid_list' objects, which store an object type and an oid_array of objects. The data structures are documented in 'struct path_walk_context', but in summary the most important are: * 'paths_to_lists' is a strmap that connects a path to a type_and_oid_list for that path. To avoid conflicts in path names, we make sure that tree paths end in "/" (except the root path with is an empty string) and blob paths do not end in "/". * 'path_stack' is a string list that is added to in an append-only way. This stores the stack of our depth-first search on the heap instead of using recursion. * 'path_stack_pushed' is a strmap that stores path names that were already added to 'path_stack', to avoid repeating paths in the stack. Mostly, this saves us from quadratic lookups from doing unsorted checks into the string_list. The coupling of 'path_stack' and 'path_stack_pushed' is protected by the push_to_stack() method. Call this instead of inserting into these structures directly. The walk_objects_by_path() method initializes these structures and starts walking commits from the given rev_info struct. The commits are used to find the list of root trees which populate the start of our depth-first search. The core of our depth-first search is in a while loop that continues while we have not indicated an early exit and our 'path_stack' still has entries in it. The loop body pops a path off of the stack and "visits" the path via the walk_path() method. The walk_path() method gets the list of OIDs from the 'path_to_lists' strmap and executes the callback method on that list with the given path and type. If the OIDs correspond to tree objects, then iterate over all trees in the list and run add_children() to add the child objects to their own lists, adding new entries to the stack if necessary. In testing, this depth-first search approach was the one that used the least memory while iterating over the object lists. There is still a chance that repositories with too-wide path patterns could cause memory pressure issues. Limiting the stack size could be done in the future by limiting how many objects are being considered in-progress, or by visiting blob paths earlier than trees. There are many future adaptations that could be made, but they are left for future updates when consumers are ready to take advantage of those features. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-12-20 08:37:04 -08:00

23 Commits