git/path-walk.h
Derrick Stolee 8a25c02265 backfill: add --sparse option
One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensie as computing blob diffs will
trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2025-02-06 19:33:02 +01:00

76 lines
2.1 KiB
C

/*
* path-walk.h : Methods and structures for walking the object graph in batches
* by the paths that can reach those objects.
*/
#include "object.h" /* Required for 'enum object_type'. */
struct rev_info;
struct oid_array;
struct pattern_list;
/**
* The type of a function pointer for the method that is called on a list of
* objects reachable at a given path.
*/
typedef int (*path_fn)(const char *path,
struct oid_array *oids,
enum object_type type,
void *data);
struct path_walk_info {
/**
* revs provides the definitions for the commit walk, including
* which commits are UNINTERESTING or not.
*/
struct rev_info *revs;
/**
* The caller wishes to execute custom logic on objects reachable at a
* given path. Every reachable object will be visited exactly once, and
* the first path to see an object wins. This may not be a stable choice.
*/
path_fn path_fn;
void *path_fn_data;
/**
* Initialize which object types the path_fn should be called on. This
* could also limit the walk to skip blobs if not set.
*/
int commits;
int trees;
int blobs;
int tags;
/**
* When 'prune_all_uninteresting' is set and a path has all objects
* marked as UNINTERESTING, then the path-walk will not visit those
* objects. It will not call path_fn on those objects and will not
* walk the children of such trees.
*/
int prune_all_uninteresting;
/**
* Specify a sparse-checkout definition to match our paths to. Do not
* walk outside of this sparse definition. If the patterns are in
* cone mode, then the search may prune directories that are outside
* of the cone. If not in cone mode, then all tree paths will be
* explored but the path_fn will only be called when the path matches
* the sparse-checkout patterns.
*/
struct pattern_list *pl;
};
#define PATH_WALK_INFO_INIT { \
.blobs = 1, \
.trees = 1, \
.commits = 1, \
.tags = 1, \
}
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
* call 'info->path_fn' on each discovered path.
*
* Returns nonzero on an error.
*/
int walk_objects_by_path(struct path_walk_info *info);