Commit Graph

3060 Commits

Author SHA1 Message Date
Johannes Schindelin
ca318aa4a7 Add experimental 'git survey' builtin (#5174)
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft/git#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
2024-12-17 12:04:02 +01:00
Johannes Schindelin
994196d4b4 Introduce 'git backfill' to get missing blobs in a partial clone (#5172)
This change introduces the `git backfill` command which uses the path
walk API to download missing blobs in a blobless partial clone.

By downloading blobs that correspond to the same file path at the same
time, we hope to maximize the potential benefits of delta compression
against multiple versions.

These downloads occur in a configurable batch size, presenting a
mechanism to perform "resumable" clones: `git clone --filter=blob:none`
gets the commits and trees, then `git backfill` will download all
missing blobs. If `git backfill` is interrupted partway through, it can
be restarted and will redownload only the missing objects.

When combining blobless partial clones with sparse-checkout, `git
backfill` will assume its `--sparse` option and download only the blobs
within the sparse-checkout. Users may want to do this as the repo size
will still be smaller than the full repo size, but commands like `git
blame` or `git log -L` will not suffer from many one-by-one blob
downloads.

Future directions should consider adding a pathspec or file prefix to
further focus which paths are being downloaded in a batch.
2024-12-17 12:04:02 +01:00
Derrick Stolee
fcf6f6f6d5 Add path walk API and its use in 'git pack-objects' (#5171)
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget/git#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
2024-12-17 12:04:01 +01:00
Johannes Schindelin
1775126925 pack-objects: create new name-hash algorithm (#5157)
This is an updated version of gitgitgadget/git#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
2024-12-17 12:04:01 +01:00
Johannes Schindelin
83d4166cce Lazy load libcurl, allowing for an SSL/TLS backend-specific libcurl (#4410)
As per
https://github.com/git-for-windows/git/issues/4350#issuecomment-1485041503,
the major block for upgrading Git for Windows' OpenSSL from v1.1 to v3
is the tricky part where such an upgrade would break `git fetch`/`git
clone` and `git push` because the libcurl depends on the OpenSSL DLL,
and the major version bump will _change_ the file name of said DLL.

To overcome that, the plan is to build libcurl flavors for each
supported SSL/TLS backend, aligning with the way MSYS2 builds libcurl,
then switch Git for Windows' SDK to the Secure Channel-flavored libcurl,
and teach Git to look for the specific flavor of libcurl corresponding
to the `http.sslBackend` setting (if that was configured).

Here is the PR to teach Git that trick.
2024-12-17 12:03:48 +01:00
Johannes Schindelin
d35ad5344f Merge pull request #2974 from derrickstolee/maintenance-and-headless
Include Windows-specific maintenance and headless-git
2024-12-17 12:03:36 +01:00
Jeff Hostetler
38cc6ef4d7 survey: stub in new experimental 'git-survey' command
Start work on a new 'git survey' command to scan the repository
for monorepo performance and scaling problems.  The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.

The initial goal is to complement the scanning and analysis performed
by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.

Co-authored-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-12-17 00:44:39 +01:00
Derrick Stolee
92eaef30a7 backfill: add builtin boilerplate
In anticipation of implementing 'git backfill', populate the necessary files
with the boilerplate of a new builtin.

RFC TODO: When preparing this for a full implementation, make sure it is
based on the newest standards introduced by [1].

[1] https://lore.kernel.org/git/xmqqjzfq2f0f.fsf@gitster.g/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7
    [PATCH 1/3] builtin: add a repository parameter for builtin functions

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-12-17 00:43:41 +01:00
Derrick Stolee
51143f56d1 t6601: add helper for testing path-walk API
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-12-17 00:43:38 +01:00
Derrick Stolee
dd1838d192 path-walk: introduce an object walk by path
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.

There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-12-16 21:04:46 +01:00
Derrick Stolee
e9e1bd09ef test-tool: add helper for name-hash values
Add a new test-tool helper, name-hash, to output the value of the
name-hash algorithms for the input list of strings, one per line.

Since the name-hash values can be stored in the .bitmap files, it is
important that these hash functions do not change across Git versions.
Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
the current values. Due to how these functions are implemented, it would
be difficult to change them without disturbing these values.

Create a performance test that uses test_size to demonstrate how
collisions occur for these hash algorithms. This test helps inform
someone as to the behavior of the name-hash algorithms for their repo
based on the paths at HEAD.

My copy of the Git repository shows modest statistics around the
collisions of the default name-hash algorithm:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                        4.5K
5314.2: number of distinct name-hashes                       4.1K
5314.3: number of distinct full-name-hashes                  4.5K
5314.4: maximum multiplicity of name-hashes                    13
5314.5: maximum multiplicity of fullname-hashes                 1

Here, the maximum collision multiplicity is 13, but around 10% of paths
have a collision with another path.

In a more interesting example, the microsoft/fluentui [1] repo had these
statistics at time of committing:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                       19.6K
5314.2: number of distinct name-hashes                       8.2K
5314.3: number of distinct full-name-hashes                 19.6K
5314.4: maximum multiplicity of name-hashes                   279
5314.5: maximum multiplicity of fullname-hashes                 1

[1] https://github.com/microsoft/fluentui

That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.

In this repository, no collisions occur for the full-name-hash
algorithm.

In a more extreme example, an internal monorepo had a much worse
collision rate:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                      221.6K
5314.2: number of distinct name-hashes                      72.0K
5314.3: number of distinct full-name-hashes                221.6K
5314.4: maximum multiplicity of name-hashes                 14.4K
5314.5: maximum multiplicity of fullname-hashes                 2

Even in this repository with many more paths at HEAD, the collision rate
was low and the maximum number of paths being grouped into a single
bucket by the full-path-name algorithm was two.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
2024-12-16 21:04:46 +01:00
Johannes Schindelin
b8bf94e542 http: support lazy-loading libcurl also on Windows
This implements the Windows-specific support code, because everything is
slightly different on Windows, even loading shared libraries.

Note: I specifically do _not_ use the code from
`compat/win32/lazyload.h` here because that code is optimized for
loading individual functions from various system DLLs, while we
specifically want to load _many_ functions from _one_ DLL here, and
distinctly not a system DLL (we expect libcurl to be located outside
`C:\Windows\system32`, something `INIT_PROC_ADDR` refuses to work with).
Also, the `curl_easy_getinfo()`/`curl_easy_setopt()` functions are
declared as vararg functions, which `lazyload.h` cannot handle. Finally,
we are about to optionally override the exact file name that is to be
loaded, which is a goal contrary to `lazyload.h`'s design.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-12-16 21:04:44 +01:00
Johannes Schindelin
7e9e7dbfc8 http: optionally load libcurl lazily
This compile-time option allows to ask Git to load libcurl dynamically
at runtime.

Together with a follow-up patch that optionally overrides the file name
depending on the `http.sslBackend` setting, this kicks open the door for
installing multiple libcurl flavors side by side, and load the one
corresponding to the (runtime-)configured SSL/TLS backend.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-12-16 21:04:44 +01:00
Jeff Hostetler
5ebaabe6a0 Makefile: clean up .ilk files when MSVC=1
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
2024-12-16 20:58:17 +01:00
Johannes Schindelin
3f61f35355 mimalloc: offer a build-time option to enable it
By defining `USE_MIMALLOC`, Git can now be compiled with that
nicely-fast and small allocator.

Note that we have to disable a couple `DEVELOPER` options to build
mimalloc's source code, as it makes heavy use of declarations after
statements, among other things that disagree with Git's conventions.

We even have to silence some GCC warnings in non-DEVELOPER mode. For
example, the `-Wno-array-bounds` flag is needed because in `-O2` builds,
trying to call `NtCurrentTeb()` (which `_mi_thread_id()` does on
Windows) causes the bogus warning about a system header, likely related
to https://sourceforge.net/p/mingw-w64/mailman/message/37674519/ and to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578:

C:/git-sdk-64-minimal/mingw64/include/psdk_inc/intrin-impl.h:838:1:
        error: array subscript 0 is outside array bounds of 'long long unsigned int[0]' [-Werror=array-bounds]
  838 | __buildreadseg(__readgsqword, unsigned __int64, "gs", "q")
      | ^~~~~~~~~~~~~~

Also: The `mimalloc` library uses C11-style atomics, therefore we must
require that standard when compiling with GCC if we want to use
`mimalloc` (instead of requiring "only" C99). This is what we do in the
CMake definition already, therefore this commit does not need to touch
`contrib/buildsystems/`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-12-16 20:47:09 +01:00
Johannes Schindelin
bead1b4470 Import the source code of mimalloc v2.1.2
This commit imports mimalloc's source code as per v2.1.2, fetched from
the tag at https://github.com/microsoft/mimalloc.

The .c files are from the src/ subdirectory, and the .h files from the
include/ and include/mimalloc/ subdirectories. We will subsequently
modify the source code to accommodate building within Git's context.

Since we plan on using the `mi_*()` family of functions, we skip the
C++-specific source code, some POSIX compliant functions to interact
with mimalloc, and the code that wants to support auto-magic overriding
of the `malloc()` function (mimalloc-new-delete.h, alloc-posix.c,
mimalloc-override.h, alloc-override.c, alloc-override-osx.c,
alloc-override-win.c and static.c).

To appease the `check-whitespace` job of Git's Continuous Integration,
this commit was washed one time via `git rebase --whitespace=fix`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2024-12-16 20:47:09 +01:00
Junio C Hamano
29e5596eb8 Merge branch 'ps/build'
Build procedure update plus introduction of Meson based builds.

* ps/build: (24 commits)
  Introduce support for the Meson build system
  Documentation: add comparison of build systems
  t: allow overriding build dir
  t: better support for out-of-tree builds
  Documentation: extract script to generate a list of mergetools
  Documentation: teach "cmd-list.perl" about out-of-tree builds
  Documentation: allow sourcing generated includes from separate dir
  Makefile: simplify building of templates
  Makefile: write absolute program path into bin-wrappers
  Makefile: allow "bin-wrappers/" directory to exist
  Makefile: refactor generators to be PWD-independent
  Makefile: extract script to generate gitweb.js
  Makefile: extract script to generate gitweb.cgi
  Makefile: extract script to massage Python scripts
  Makefile: extract script to massage Shell scripts
  Makefile: use "generate-perl.sh" to massage Perl library
  Makefile: extract script to massage Perl scripts
  Makefile: consistently use PERL_PATH
  Makefile: generate doc versions via GIT-VERSION-GEN
  Makefile: generate "git.rc" via GIT-VERSION-GEN
  ...
2024-12-15 17:54:33 -08:00
Junio C Hamano
cd0a222f08 Merge branch 'es/oss-fuzz'
Backport oss-fuzz tests for us to our codebase.

* es/oss-fuzz:
  fuzz: port fuzz-url-decode-mem from OSS-Fuzz
  fuzz: port fuzz-parse-attr-line from OSS-Fuzz
  fuzz: port fuzz-credential-from-url-gently from OSS-Fuzz
2024-12-13 07:33:42 -08:00
Junio C Hamano
de9278127e Merge branch 'ps/reftable-detach'
Isolates the reftable subsystem from the rest of Git's codebase by
using fewer pieces of Git's infrastructure.

* ps/reftable-detach:
  reftable/system: provide thin wrapper for lockfile subsystem
  reftable/stack: drop only use of `get_locked_file_path()`
  reftable/system: provide thin wrapper for tempfile subsystem
  reftable/stack: stop using `fsync_component()` directly
  reftable/system: stop depending on "hash.h"
  reftable: explicitly handle hash format IDs
  reftable/system: move "dir.h" to its only user
2024-12-10 10:04:56 +09:00
Patrick Steinhardt
7e0730c8ba t: better support for out-of-tree builds
Our in-tree builds used by the Makefile use various different build
directories scattered around different locations. The paths to those
build directories have to be propagated to our tests such that they can
find the contained files. This is done via a mixture of hardcoded paths
in our test library and injected variables in our bin-wrappers or
"GIT-BUILD-OPTIONS".

The latter two mechanisms are preferable over using hardcoded paths. For
one, we have all paths which are subject to change stored in a small set
of central files instead of having the knowledge of build paths in many
files. And second, it allows build systems which build files elsewhere
to adapt those paths based on their own needs. This is especially nice
in the context of build systems that use out-of-tree builds like CMake
or Meson.

Remove hardcoded knowledge of build paths from our test library and move
it into our bin-wrappers and "GIT-BUILD-OPTIONS".

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:13 +09:00
Patrick Steinhardt
d2407bb8dc Makefile: write absolute program path into bin-wrappers
Write the absolute program path into our bin-wrappers. This allows us to
simplify the Meson build instructions we are about to introduce a bit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:12 +09:00
Patrick Steinhardt
95bcd6f0b7 Makefile: allow "bin-wrappers/" directory to exist
The "bin-wrappers/" directory gets created by our build system and is
populated with one script for each of our binaries. There isn't anything
inherently wrong with the current layout, but it is somewhat hard to
adapt for out-of-tree build systems.

Adapt the layout such that our "bin-wrappers/" directory always exists
and contains our "wrap-for-bin.sh" script to make things a little bit
easier for subsequent steps.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:11 +09:00
Patrick Steinhardt
3f145a4fe3 Makefile: refactor generators to be PWD-independent
We have multiple scripts that generate headers from other data. All of
these scripts have the assumption built-in that they are executed in the
current source directory, which makes them a bit unwieldy to use during
out-of-tree builds.

Refactor them to instead take the source directory as well as the output
file as arguments.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:11 +09:00
Patrick Steinhardt
b7835b941b Makefile: extract script to massage Python scripts
Extract a script that massages Python scripts. This provides a couple of
benefits:

  - The build logic is deduplicated across Make, CMake and Meson.

  - CMake learns to rewrite scripts as-needed at build time instead of
    only writing them at configure time.

Furthermore, we will use this script when introducing Meson to
deduplicate the logic across build systems.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:10 +09:00
Patrick Steinhardt
eb98cb835c Makefile: extract script to massage Shell scripts
Same as in the preceding commits, extract a script that allows us to
unify how we massage shell scripts.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:10 +09:00
Patrick Steinhardt
ccfba9e0c4 Makefile: use "generate-perl.sh" to massage Perl library
Extend "generate-perl.sh" such that it knows to also massage the Perl
library files. There are two major differences:

  - We do not read in the Perl header. This is handled by matching on
    whether or not we have a Perl shebang.

  - We substitute some more variables, which we read in via our
    GIT-BUILD-OPTIONS.

Adapt both our Makefile and the CMake build instructions to use this.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:10 +09:00
Patrick Steinhardt
e4b488049a Makefile: extract script to massage Perl scripts
Extract the script to inject various build-time parameters into our Perl
scripts into a standalone script. This is done such that we can reuse it
in other build systems.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:09 +09:00
Patrick Steinhardt
c2a3b847ed Makefile: consistently use PERL_PATH
When injecting the Perl path into our scripts we sometimes use '@PERL@'
while we othertimes use '@PERL_PATH@'. Refactor the code use the latter
consistently, which makes it easier to reuse the same logic for multiple
scripts.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:09 +09:00
Patrick Steinhardt
9bb10d27e7 Makefile: generate "git.rc" via GIT-VERSION-GEN
The "git.rc" is used on Windows to embed information like the project
name and version into the resulting executables. As such we need to
inject the version information, which we do by using preprocessor
defines. The logic to do so is non-trivial and needs to be kept in sync
with the different build systems.

Refactor the logic so that we generate "git.rc" via `GIT-VERSION-GEN`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:09 +09:00
Patrick Steinhardt
0c8d339514 Makefile: propagate Git version via generated header
We set up a couple of preprocessor macros when compiling Git that
propagate the version that Git was built from to `git version` et al.
The way this is set up makes it harder than necessary to reuse the
infrastructure across the different build systems.

Refactor this such that we generate a "version-def.h" header via
`GIT-VERSION-GEN` instead.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:08 +09:00
Patrick Steinhardt
4838deab65 Makefile: refactor GIT-VERSION-GEN to be reusable
Our "GIT-VERSION-GEN" script always writes the "GIT-VERSION-FILE" into
the current directory, where the expectation is that it should exist in
the source directory. But other build systems that support out-of-tree
builds may not want to do that to keep the source directory pristine,
even though CMake currently doesn't care.

Refactor the script such that it won't write the "GIT-VERSION-FILE"
directly anymore, but instead knows to replace @PLACEHOLDERS@ in an
arbitrary input file. This allows us to simplify the logic in CMake to
determine the project version, but can also be reused later on in order
to generate other files that need to contain version information like
our "git.rc" file.

While at it, change the format of the version file by removing the
spaces around the equals sign. Like this we can continue to include the
file in our Makefiles, but can also start to source it in shell scripts
in subsequent steps.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:08 +09:00
Patrick Steinhardt
dbe46c0feb Makefile: consistently use @PLACEHOLDER@ to substitute
We have a bunch of placeholders in our scripts that we replace at build
time, for example by using sed(1). These placeholders come in three
different formats: @PLACEHOLDER@, @@PLACEHOLDER@@ and ++PLACEHOLDER++.

Next to being inconsistent it also creates a bit of a problem with
CMake, which only supports the first syntax in its `configure_file()`
function. To work around that we instead manually replace placeholders
via string operations, which is a hassle and removes safeguards that
CMake has to verify that we didn't forget to replace any placeholders.
Besides that, other build systems like Meson also support the CMake
syntax.

Unify our codebase to consistently use the syntax supported by such
build systems.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:08 +09:00
Patrick Steinhardt
4638e8806e Makefile: use common template for GIT-BUILD-OPTIONS
The "GIT-BUILD-OPTIONS" file is generated by our build systems to
propagate built-in features and paths to our tests. The generation is
done ad-hoc, where both our Makefile and the CMake build instructions
simply echo a bunch of strings into the file. This makes it very hard to
figure out what variables are expected to exist and what format they
have, and the written variables can easily get out of sync between build
systems.

Introduce a new "GIT-BUILD-OPTIONS.in" template to address this issue.
This has multiple advantages:

  - It demonstrates which built options exist in the first place.

  - It can serve as a spot to document the build options.

  - Some build systems complain when not all variables could be
    substituted, alerting us of mismatches. Others don't, but if we
    forgot to substitute such variables we now have a bogus string that
    will likely cause our tests to fail, if they have any meaning in the
    first place.

Backfill values that we didn't yet set in our CMake build instructions.
While at it, remove the `SUPPORTS_SIMPLE_IPC` variable that we only set
up in CMake as it isn't used anywhere.

This change requires us to adapt the setup of TEST_OUTPUT_DIRECTORY in
"test-lib.sh" such that it does not get overwritten after sourcing when
it has been set up via the environment. This is the only instance I
could find where we rely on ordering on variables.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-07 07:52:08 +09:00
Patrick Steinhardt
01e49941d6 reftable/system: provide thin wrapper for tempfile subsystem
We use the tempfile subsystem to write temporary tables, but given that
we're in the process of converting the reftable library to become
standalone we cannot use this subsystem directly anymore. While we could
in theory convert the code to use mkstemp(3p) instead, we'd lose access
to our infrastructure that automatically prunes tempfiles via atexit(3p)
or signal handlers.

Provide a thin wrapper for the tempfile subsystem instead. Like this,
the compatibility shim is fully self-contained in "reftable/system.c".
Downstream users of the reftable library would have to implement their
own tempfile shims by replacing "system.c" with a custom version.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-19 12:23:10 +09:00
Patrick Steinhardt
5dac35bbde Makefile: let clar header targets depend on their scripts
The targets that generate clar headers depend on their source files, but
not on the script that is actually generating the output. Fix the issue
by adding the missing dependencies.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-18 09:59:26 +09:00
Patrick Steinhardt
9a91ab9400 t/unit-tests: convert "clar-generate.awk" into a shell script
Convert "clar-generate.awk" into a shell script that invokes awk(1).
This allows us to avoid the shell redirect in the build system, which
may otherwise be a problem with build systems on platforms that use a
different shell.

While at it, wrap the overly long lines in the CMake build instructions.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-18 09:59:25 +09:00
Junio C Hamano
1ee7dbde67 Merge branch 'ps/upgrade-clar'
Buildfix and upgrade of Clar to a newer version.

* ps/upgrade-clar:
  cmake: set up proper dependencies for generated clar headers
  cmake: fix compilation of clar-based unit tests
  Makefile: extract script to generate clar declarations
  Makefile: adjust sed command for generating "clar-decls.h"
  t/unit-tests: update clar to 206accb
2024-11-08 12:56:28 +09:00
Taylor Blau
268fd2fe58 Merge branch 'ps/platform-compat-fixes'
Various platform compatibility fixes split out of the larger effort
to use Meson as the primary build tool.

* ps/platform-compat-fixes:
  t6006: fix prereq handling with `test_format ()`
  http: fix build error on FreeBSD
  builtin/credential-cache: fix missing parameter for stub function
  t7300: work around platform-specific behaviour with long paths on MinGW
  t5500, t5601: skip tests which exercise paths with '[::1]' on Cygwin
  t3404: work around platform-specific behaviour on macOS 10.15
  t1401: make invocation of tar(1) work with Win32-provided one
  t/lib-gpg: fix setup of GNUPGHOME in MinGW
  t/lib-gitweb: test against the build version of gitweb
  t/test-lib: wire up NO_ICONV prerequisite
  t/test-lib: fix quoting of TEST_RESULTS_SAN_FILE
2024-11-01 12:53:17 -04:00
Taylor Blau
dca32b8288 Merge branch 'pb/clar-build-fix'
Build fix.

* pb/clar-build-fix:
  Makefile: fix dependency for $(UNIT_TEST_DIR)/clar/clar.o
2024-10-25 14:02:25 -04:00
Patrick Steinhardt
67f75dfe1b Makefile: extract script to generate clar declarations
Extract the script to generate function declarations for the clar unit
testing framework into a standalone script. This is done such that we
can reuse it in other build systems.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-21 16:53:07 -04:00
Alejandro R. Sedeño
a779c8e8d5 Makefile: adjust sed command for generating "clar-decls.h"
This moves the end-of-line marker out of the captured group, matching
the start-of-line marker and for some reason fixing generation of
"clar-decls.h" on some older, more esoteric platforms.

Signed-off-by: Alejandro R. Sedeño <asedeno@mit.edu>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-21 16:53:07 -04:00
Eric Sesterhenn
751d063f27 fuzz: port fuzz-url-decode-mem from OSS-Fuzz
Git's fuzz tests are run continuously as part of OSS-Fuzz [1]. Several
additional fuzz tests have been contributed directly to OSS-Fuzz;
however, these tests are vulnerable to bitrot because they are not built
during Git's CI runs, and thus breaking changes are much less likely to
be noticed by Git contributors.

Port one of these tests back to the Git project:
fuzz-url-decode-mem

This test was originally written by Eric Sesterhenn as part of a
security audit of Git [2]. It was then contributed to the OSS-Fuzz repo
in commit c58ac4492 (Git fuzzing: uncomment the existing and add new
targets. (#11486), 2024-02-21) by Jaroslav Lobačevski. I (Josh Steadmon)
have verified with both Eric and Jaroslav that they're OK with moving
this test to the Git project.

[1] https://github.com/google/oss-fuzz
[2] https://ostif.org/wp-content/uploads/2023/01/X41-OSTIF-Gitlab-Git-Security-Audit-20230117-public.pdf

Co-authored-by: Jaroslav Lobačevski <jarlob@gmail.com>
Co-authored-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-16 18:14:11 -04:00
Eric Sesterhenn
72686d4e5e fuzz: port fuzz-parse-attr-line from OSS-Fuzz
Git's fuzz tests are run continuously as part of OSS-Fuzz [1]. Several
additional fuzz tests have been contributed directly to OSS-Fuzz;
however, these tests are vulnerable to bitrot because they are not built
during Git's CI runs, and thus breaking changes are much less likely to
be noticed by Git contributors.

Port one of these tests back to the Git project:
fuzz-parse-attr-line

This test was originally written by Eric Sesterhenn as part of a
security audit of Git [2]. It was then contributed to the OSS-Fuzz repo
in commit c58ac4492 (Git fuzzing: uncomment the existing and add new
targets. (#11486), 2024-02-21) by Jaroslav Lobačevski. I (Josh Steadmon)
have verified with both Eric and Jaroslav that they're OK with moving
this test to the Git project.

[1] https://github.com/google/oss-fuzz
[2] https://ostif.org/wp-content/uploads/2023/01/X41-OSTIF-Gitlab-Git-Security-Audit-20230117-public.pdf

Co-authored-by: Jaroslav Lobačevski <jarlob@gmail.com>
Co-authored-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-16 18:14:11 -04:00
Eric Sesterhenn
966253db75 fuzz: port fuzz-credential-from-url-gently from OSS-Fuzz
Git's fuzz tests are run continuously as part of OSS-Fuzz [1]. Several
additional fuzz tests have been contributed directly to OSS-Fuzz;
however, these tests are vulnerable to bitrot because they are not built
during Git's CI runs, and thus breaking changes are much less likely to
be noticed by Git contributors.

Port one of these tests back to the Git project:
fuzz-credential-from-url-gently

This test was originally written by Eric Sesterhenn as part of a
security audit of Git [2]. It was then contributed to the OSS-Fuzz repo
in commit c58ac4492 (Git fuzzing: uncomment the existing and add new
targets. (#11486), 2024-02-21) by Jaroslav Lobačevski. I (Josh Steadmon)
have verified with both Eric and Jaroslav that they're OK with moving
this test to the Git project.

[1] https://github.com/google/oss-fuzz
[2] https://ostif.org/wp-content/uploads/2023/01/X41-OSTIF-Gitlab-Git-Security-Audit-20230117-public.pdf

Co-authored-by: Jaroslav Lobačevski <jarlob@gmail.com>
Co-authored-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-16 18:14:11 -04:00
Patrick Steinhardt
df383b5842 t/test-lib: wire up NO_ICONV prerequisite
The iconv library is used by Git to reencode files, commit messages and
other things. As such it is a rather integral part, but given that many
platforms nowadays use UTF-8 everywhere you can live without support for
reencoding in many situations. It is thus optional to build Git with
iconv, and some of our platforms wired up in "config.mak.uname" disable
it. But while we support building without it, running our test suite
with "NO_ICONV=Yes" causes many test failures.

Wire up a new test prerequisite ICONV that gets populated via our
GIT-BUILD-OPTIONS. Annotate failing tests accordingly.

Note that this commit does not do a deep dive into every single test to
assess whether the failure is expected or not. Most of the tests do
smell like the expected kind of failure though.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-16 17:00:49 -04:00
Philippe Blain
ea3422662d Makefile: fix dependency for $(UNIT_TEST_DIR)/clar/clar.o
The clar source file '$(UNIT_TEST_DIR)/clar/clar.c' includes the
generated 'clar.suite', but this dependency is not taken into account by
our Makefile, so that it is possible for a parallel build to fail if
Make tries to build 'clar.o' before 'clar.suite' is generated.

Correctly specify the dependency.

Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-11 11:08:08 -07:00
Junio C Hamano
5575c713c2 Merge branch 'ps/reftable-alloc-failures'
The reftable library is now prepared to expect that the memory
allocation function given to it may fail to allocate and to deal
with such an error.

* ps/reftable-alloc-failures: (26 commits)
  reftable/basics: fix segfault when growing `names` array fails
  reftable/basics: ban standard allocator functions
  reftable: introduce `REFTABLE_FREE_AND_NULL()`
  reftable: fix calls to free(3P)
  reftable: handle trivial allocation failures
  reftable/tree: handle allocation failures
  reftable/pq: handle allocation failures when adding entries
  reftable/block: handle allocation failures
  reftable/blocksource: handle allocation failures
  reftable/iter: handle allocation failures when creating indexed table iter
  reftable/stack: handle allocation failures in auto compaction
  reftable/stack: handle allocation failures in `stack_compact_range()`
  reftable/stack: handle allocation failures in `reftable_new_stack()`
  reftable/stack: handle allocation failures on reload
  reftable/reader: handle allocation failures in `reader_init_iter()`
  reftable/reader: handle allocation failures for unindexed reader
  reftable/merged: handle allocation failures in `merged_table_init_iter()`
  reftable/writer: handle allocation failures in `reftable_new_writer()`
  reftable/writer: handle allocation failures in `writer_index_hash()`
  reftable/record: handle allocation failures when decoding records
  ...
2024-10-10 14:22:25 -07:00
Patrick Steinhardt
a5a15a4514 reftable/basics: merge "publicbasics" into "basics"
The split between "basics" and "publicbasics" is somewhat arbitrary and
not in line with how we typically structure code in the reftable
library. While we do indeed split up headers into a public and internal
part, we don't do that for the compilation unit itself. Furthermore, the
declarations for "publicbasics.c" are in "reftable-malloc.h", which
isn't in line with our naming schema, either.

Fix these inconsistencies by:

  - Merging "publicbasics.c" into "basics.c".

  - Renaming "reftable-malloc.h" to "reftable-basics.h" as the public
    header.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-02 07:53:51 -07:00
Junio C Hamano
ead0a050e2 Merge branch 'tb/weak-sha1-for-tail-sum'
The checksum at the tail of files are now computed without
collision detection protection.  This is safe as the consumer of
the information to protect itself from replay attacks checks for
hash collisions independently.

* tb/weak-sha1-for-tail-sum:
  csum-file.c: use unsafe SHA-1 implementation when available
  Makefile: allow specifying a SHA-1 for non-cryptographic uses
  hash.h: scaffolding for _unsafe hashing variants
  sha1: do not redefine `platform_SHA_CTX` and friends
  pack-objects: use finalize_object_file() to rename pack/idx/etc
  finalize_object_file(): implement collision check
  finalize_object_file(): refactor unlink_or_warn() placement
  finalize_object_file(): check for name collision before renaming
2024-10-02 07:46:27 -07:00
Taylor Blau
06c92dafb8 Makefile: allow specifying a SHA-1 for non-cryptographic uses
Introduce _UNSAFE variants of the OPENSSL_SHA1, BLK_SHA1, and
APPLE_COMMON_CRYPTO_SHA1 compile-time knobs which indicate which SHA-1
implementation is to be used for non-cryptographic uses.

There are a couple of small implementation notes worth mentioning:

  - There is no way to select the collision detecting SHA-1 as the
    "fast" fallback, since the fast fallback is only for
    non-cryptographic uses, and is meant to be faster than our
    collision-detecting implementation.

  - There are no similar knobs for SHA-256, since no collision attacks
    are presently known and thus no collision-detecting implementations
    actually exist.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-27 11:27:47 -07:00