mirror of
https://github.com/git-for-windows/git.git
synced 2026-06-18 01:39:20 -05:00
e9ae852cd992cc34abc082d4eef8aeae48b25feb
11 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2b8d07ef91 |
path-walk: support object:type filter
The `object:type` filter accepts only objects of a single type; it is
the second member of the object-info-only filter family that bitmap
traversal already supports.
Like `blob:none` and `tree:0`, it can be evaluated with nothing more
than the object's type, which is exactly the granularity path-walk's
existing info->{commits,trees,blobs,tags} flags already control.
Map `LOFC_OBJECT_TYPE` in `prepare_filters()` by AND-ing each flag
against the filtered type. A single `object:type=X` filter
applied to the default info (all flags = 1) leaves `info->X = 1` and
all the others 0, which is what we want.
Using an AND rather than straight assignment prepares us for a
subsequent change to implement combined object filters.
The path-walk machinery is mostly already wired for the per-type
distinction:
- `walk_path()` calls `path_fn` for a batch only when the corresponding
`info->X` flag is set, so unwanted types are silently not reported.
- `add_tree_entries()` skips tree entries of type `OBJ_BLOB` when
`info->blobs` is unset, so we don't even allocate paths for them.
- The commit-walk loop short-circuits the root-tree fetch when
`!info->trees && !info->blobs`, so commit-only filters don't descend
into trees at all.
But there are a couple of side effects of the "trees off, blobs on" case
that need fixing:
1. 'setup_pending_objects()' previously skipped pending trees as soon
as `info->trees` was zero. For 'object:type=blob' the call site
needs those pending trees: a lightweight tag pointing to a tree, or
an annotated tag whose peeled target is a tree, can both reach
blobs that are otherwise unreachable from any commit's root tree.
Loosen the gate to "if (!info->trees && !info->blobs) continue" and
similarly retrieve the root_tree_list whenever either trees or
blobs are wanted.
2. The revision machinery's `handle_commit()` drops pending trees when
`revs->tree_objects` is zero (see the 'OBJ_TREE' handler in
revision.c), so by the time path-walk sees the pending list
after `prepare_revision_walk()` the tree-bearing pendings would
already be gone. Fix this by setting
revs->tree_objects = info->trees || info->blobs
so pending trees survive `prepare_revision_walk()` whenever we
need to walk into them. Path-walk still resets tree_objects to
zero immediately after `prepare_revision_walk()` returns, so the
rev-walk itself never enumerates trees redundantly with
path-walk's own descent.
Add coverage in t6601 for each of the four `object:type` values. The
'object:type=blob' test in particular asserts that file2 and child/file
(both reachable only through tag-pointed trees) show up in the output,
exercising the pending-tree fix.
Update Documentation/git-pack-objects.adoc to add object:type to
the list of supported --filter forms.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
||
|
|
8ff8de7616 |
path-walk: add pl_sparse_trees to control tree pruning
The path-walk API prunes trees and blobs when a sparse-checkout pattern list is provided, which is the correct behavior for 'git backfill --sparse' since it only needs to fill in objects at paths within the sparse cone. However, a future change will use the path-walk API with a sparse:<oid> filter that restricts only blobs while retaining all reachable trees. To support both behaviors, add a 'pl_sparse_trees' flag to path_walk_info. When set (as in 'git backfill --sparse' and the --stdin-pl test helper mode), the sparse patterns prune both trees and blobs. When unset, only blobs are filtered and all trees are walked and reported. Additionally, move the SEEN flag assignment in add_tree_entries() to after the sparse pattern and pathspec checks. Previously, SEEN was set immediately upon discovering an object, before checking whether its path matched the sparse patterns. When the same object ID appeared at multiple paths (e.g. sibling directories with identical contents), the first path to be visited would mark the object as SEEN. If that path was outside the sparse cone, the object would be skipped there but also never discovered at its in-cone path. By deferring the SEEN flag until after the checks pass, objects that are skipped due to sparse filtering remain discoverable at other paths where they may be in scope. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
f1b5d3da16 |
path-walk: support blob size limit filter
Extend the path-walk API to handle the 'blob:limit=<size>' object filter natively. This filter omits blobs whose size is equal to or greater than the given limit, matching the semantics used by the list-objects-filter machinery. When revs->filter.choice is LOFC_BLOB_LIMIT, the prepare_filters() method stores the limit value in info->blob_limit and clears the filter from revs. If the limit is zero, this degenerates to blob:none (all blobs excluded), so info->blobs is set to 0 instead. During walk_path(), blob batches are filtered before being delivered to the callback: each blob's size is checked via odb_read_object_info(), and only blobs strictly smaller than the limit are included. Blobs whose size cannot be determined (e.g. missing in a partial clone) are conservatively included, matching the existing filter behavior. Empty batches after filtering are skipped entirely. The check for inclusion in the path batch looks a little strange at first glance. We use odb_read_object_info() to read the object's size. Based on all of the assumptions to this point, this _should_ return OBJ_BLOB. Since we are focused on the size filter, we use a short-circuited OR (||) to skip the size check if that method returns a different object type. Notice that this inspection of object sizes requires the content to be present in the repository. The odb_read_object_info() call will download a missing blob on-demand. This means that the use of the path-walk API within 'git backfill' would not operate nicely with this filter type. The intention of that command is to download missing blobs in batches. Downloading objects one-by-one would go against the point. Update the validation in 'git backfill' to add its own compatibility check on top of path_walk_filter_compatible(). Add tests for blob:limit=0 (equivalent to blob:none) and blob:limit=3 (which exercises partial filtering within a batch where some blobs are kept and others are excluded). Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
6d87f0e8a3 |
path-walk: support blobless filter
The 'git pack-objects' command can opt-in to using the path-walk API for scanning the objects. Currently, this option is dynamically disabled if combined with '--filter=<X>', even when using a simple filter such as 'blob:none' to signal a blobless packfile. This is a common scenario for repos at scale, so is worth integrating. Also, users can opt-in to the '--path-walk' option by default through the pack.usePathWalk=true config option. When using that in a blobless partial clone, the following warning can appear even though the user did not specify either option directly: warning: cannot use --filter with --path-walk Teach the path-walk API to handle the 'blob:none' object filter natively. When revs->filter.choice is LOFC_BLOB_NONE, the path-walk sets info->blobs to 0 (skipping all blob objects) and clears the filter from revs so that prepare_revision_walk() does not reject the configuration. This check is implemented in the static prepare_filters() method, which will simultaneously check if the input filters are compatible and will make the appropriate mutations to the path_walk_info and filters if the path_walk_info is non-NULL. This allows us to use this logic both in the API method path_walk_filter_compatible() for use in builtin/pack-objects.c and as a prep step in walk_objects_by_path(). Update the test helper (test-path-walk) to accept --filter=<spec> as a test-tool option (before '--'), applying it to revs after setup_revisions() to avoid the --objects requirement check. We can also revert recent GIT_TEST_PACK_PATH_WALK overrides in t5620. Also switch test-path-walk from REV_INFO_INIT with manual repo assignment to repo_init_revisions(), which properly initializes the filter_spec strbuf needed for filter parsing. Add tests for blob:none with --all and with a single branch. The performance test p5315 shows the impact of this change when using blobless filters: Test HEAD~1 HEAD --------------------------------------------------------------------- 5315.6: repack (blob:none) 13.53 13.87 +2.5% 5315.7: repack size (blob:none) 137.7M 137.8M +0.1% 5315.8: repack (blob:none, --path-walk) 13.51 23.43 +73.4% 5315.9: repack size (blob:none, --path-walk) 137.7M 115.2M -16.3% These performance tests were run on the Git repository. The --path-walk feature shows meaningful space savings (16% smaller for blobless packs) at the cost of increased computation time due to the two compression passes. This data demonstrates that the feature is engaged and provides real compression benefits when --no-reuse-delta forces fresh deltas. Co-Authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
7a7070eebc |
path-walk: always emit directly-requested objects
We are preparing to integrate the path-walk API with some --filter options in 'git pack-objects', but there is a subtle issue that is revealed when those are put together and the test suite is run with GIT_TEST_PACK_PATH_WALK=1. When a filter reduces the set of requested objects, this results in filtering out directly-requested objects, such as in the download of needed blobs in a blobless partial clone. The root cause is that the scan of pending objects in the path-walk API respects the filters set in the path_walk_info instead of overriding them for pending objects. We can tell that a path is part of the directly-referenced objects if its path name starts with '/' (other paths, including root trees never have this starting character). Create a path_is_for_direct_objects() to make this meaning clear, especially as we add more references in the future as we integrate the path-walk API with partial clone filter options. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
4705889c3d |
path-walk: add new 'edge_aggressive' option
In preparation for allowing both the --shallow and --path-walk options in the 'git pack-objects' builtin, create a new 'edge_aggressive' option in the path-walk API. This option will help walk the boundary more thoroughly and help avoid sending extra objects during fetches and pushes. The only use of the 'edge_hint_aggressive' option in the revision API is within mark_edges_uninteresting(), which is usually called before between prepare_revision_walk() and before visiting commits with get_revision(). In prepare_revision_walk(), the UNINTERESTING commits are walked until a boundary is found. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
bff4555767 |
backfill: add --sparse option
One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensive as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Non-cone mode can describe the included files using both positive and negative patterns, which changes the possible return values of path_matches_pattern_list(). Test both kinds of matches for increased coverage. To test this, we can create a blobless sparse clone, expand the sparse-checkout slightly, and then run 'git backfill --sparse' to see how much data is downloaded. The general steps are 1. git clone --filter=blob:none --sparse <url> 2. git sparse-checkout set <dir1> ... <dirN> 3. git backfill --sparse For the Git repository with the 'builtin' directory in the sparse-checkout, we get these results for various batch sizes: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|-------| | (Initial clone) | 3 | 110 MB | | | 10K | 12 | 192 MB | 17.2s | | 15K | 9 | 192 MB | 15.5s | | 20K | 8 | 192 MB | 15.5s | | 25K | 7 | 192 MB | 14.7s | This case matters less because a full clone of the Git repository from GitHub is currently at 277 MB. Using a copy of the Linux repository with the 'kernel/' directory in the sparse-checkout, we get these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|------| | (Initial clone) | 2 | 1,876 MB | | | 10K | 11 | 2,187 MB | 46s | | 25K | 7 | 2,188 MB | 43s | | 50K | 5 | 2,194 MB | 44s | | 100K | 4 | 2,194 MB | 48s | This case is more meaningful because a full clone of the Linux repository is currently over 6 GB, so this is a valuable way to download a fraction of the repository and no longer need network access for all reachable objects within the sparse-checkout. Choosing a batch size will depend on a lot of factors, including the user's network speed or reliability, the repository's file structure, and how many versions there are of the file within the sparse-checkout scope. There will not be a one-size-fits-all solution. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
6333e7ae0b |
path-walk: mark trees and blobs as UNINTERESTING
When the input rev_info has UNINTERESTING starting points, we want to be sure that the UNINTERESTING flag is passed appropriately through the objects. To match how this is done in places such as 'git pack-objects', we use the mark_edges_uninteresting() method. This method has an option for using the "sparse" walk, which is similar in spirit to the path-walk API's walk. To be sure to keep it independent, add a new 'prune_all_uninteresting' option to the path_walk_info struct. To check how the UNINTERSTING flag is spread through our objects, extend the 'test-tool path-walk' command to output whether or not an object has that flag. This changes our tests significantly, including the removal of some objects that were previously visited due to the incomplete implementation. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
9145660979 |
path-walk: visit tags and cached objects
The rev_info that is specified for a path-walk traversal may specify visiting tag refs (both lightweight and annotated) and also may specify indexed objects (blobs and trees). Update the path-walk API to walk these objects as well. When walking tags, we need to peel the annotated objects until reaching a non-tag object. If we reach a commit, then we can add it to the pending objects to make sure we visit in the commit walk portion. If we reach a tree, then we will assume that it is a root tree. If we reach a blob, then we have no good path name and so add it to a new list of "tagged blobs". When the rev_info includes the "--indexed-objects" flag, then the pending set includes blobs and trees found in the cache entries and cache-tree. The cache entries are usually blobs, though they could be trees in the case of a sparse index. The cache-tree stores previously-hashed tree objects but these are cleared out when staging objects below those paths. We add tests that demonstrate this. The indexed objects come with a non-NULL 'path' value in the pending item. This allows us to prepopulate the 'path_to_lists' strmap with lists for these paths. The tricky thing about this walk is that we will want to combine the indexed objects walk with the commit walk, especially in the future case of walking objects during a command like 'git repack'. Whenever possible, we want the objects from the index to be grouped with similar objects in history. We don't want to miss any paths that appear only in the index and not in the commit history. Thus, we need to be careful to let the path stack be populated initially with only the root tree path (and possibly tags and tagged blobs) and go through the normal depth-first search. Afterwards, if there are other paths that are remaining in the paths_to_lists strmap, we should then iterate through the stack and visit those objects recursively. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
c8dba310d7 |
path-walk: allow consumer to specify object types
We add the ability to filter the object types in the path-walk API so the callback function is called fewer times. This adds the ability to ask for the commits in a list, as well. We re-use the empty string for this set of objects because these are passed directly to the callback function instead of being part of the 'path_stack'. Future changes will add the ability to visit annotated tags. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
||
|
|
9d46bc791b |
path-walk: introduce an object walk by path
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.
These batches are collected in 'struct type_and_oid_list' objects, which
store an object type and an oid_array of objects.
The data structures are documented in 'struct path_walk_context', but in
summary the most important are:
* 'paths_to_lists' is a strmap that connects a path to a
type_and_oid_list for that path. To avoid conflicts in path names,
we make sure that tree paths end in "/" (except the root path with
is an empty string) and blob paths do not end in "/".
* 'path_stack' is a string list that is added to in an append-only
way. This stores the stack of our depth-first search on the heap
instead of using recursion.
* 'path_stack_pushed' is a strmap that stores path names that were
already added to 'path_stack', to avoid repeating paths in the
stack. Mostly, this saves us from quadratic lookups from doing
unsorted checks into the string_list.
The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
push_to_stack() method. Call this instead of inserting into these
structures directly.
The walk_objects_by_path() method initializes these structures and
starts walking commits from the given rev_info struct. The commits are
used to find the list of root trees which populate the start of our
depth-first search.
The core of our depth-first search is in a while loop that continues
while we have not indicated an early exit and our 'path_stack' still has
entries in it. The loop body pops a path off of the stack and "visits"
the path via the walk_path() method.
The walk_path() method gets the list of OIDs from the 'path_to_lists'
strmap and executes the callback method on that list with the given path
and type. If the OIDs correspond to tree objects, then iterate over all
trees in the list and run add_children() to add the child objects to
their own lists, adding new entries to the stack if necessary.
In testing, this depth-first search approach was the one that used the
least memory while iterating over the object lists. There is still a
chance that repositories with too-wide path patterns could cause memory
pressure issues. Limiting the stack size could be done in the future by
limiting how many objects are being considered in-progress, or by
visiting blob paths earlier than trees.
There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|