The 'git survey' builtin provides several detail tables, such as "top
files by on-disk size". The size of these tables defaults to 10,
currently.
Allow the user to specify this number via a new --top=<N> option or the
new survey.top config key.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Since we are already walking our reachable objects using the path-walk API,
let's now collect lists of the paths that contribute most to different
metrics. Specifically, we care about
* Number of versions.
* Total size on disk.
* Total inflated size (no delta or zlib compression).
This information can be critical to discovering which parts of the
repository are causing the most growth, especially on-disk size. Different
packing strategies might help compress data more efficiently, but the toal
inflated size is a representation of the raw size of all snapshots of those
paths. Even when stored efficiently on disk, that size represents how much
information must be processed to complete a command such as 'git blame'.
The exact disk size seems to be not quite robust enough for testing, as
could be seen by the `linux-musl-meson` job consistently failing, possibly
because of zlib-ng deflates differently: t8100.4(git survey
(default)) was failing with a symptom like this:
TOTAL OBJECT SIZES BY TYPE
===============================================
Object Type | Count | Disk Size | Inflated Size
------------+-------+-----------+--------------
- Commits | 10 | 1523 | 2153
+ Commits | 10 | 1528 | 2153
Trees | 10 | 495 | 1706
Blobs | 10 | 191 | 101
- Tags | 4 | 510 | 528
+ Tags | 4 | 547 | 528
This means: the disk size is unlikely something we can verify robustly.
Since zlib-ng seems to increase the disk size of the tags from 528 to
547, we cannot even assume that the disk size is always smaller than the
inflated size. We will most likely want to either skip verifying the
disk size altogether, or go for some kind of fuzzy matching, say, by
replacing `s/ 1[45][0-9][0-9] / ~1.5k /` and `s/ [45][0-9][0-9] / ~½k /`
or something like that.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
In future changes, we will make use of these methods. The intention is to
keep track of the top contributors according to some metric. We don't want
to store all of the entries and do a sort at the end, so track a
constant-size table and remove rows that get pushed out depending on the
chosen sorting algorithm.
Co-authored-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by; Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Now that we have explored objects by count, we can expand that a bit more to
summarize the data for the on-disk and inflated size of those objects. This
information is helpful for diagnosing both why disk space (and perhaps
clone or fetch times) is growing but also why certain operations are slow
because the inflated size of the abstract objects that must be processed is
so large.
Note: zlib-ng is slightly more efficient even at those small sizes. Even
between zlib versions, there are slight differences in compression. To
accommodate for that in the tests, not the exact numbers but some rough
approximations are validated (the test should validate `git survey`,
after all, not zlib).
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
At the moment, nothing is obvious about the reason for the use of the
path-walk API, but this will become more prevelant in future iterations. For
now, use the path-walk API to sum up the counts of each kind of object.
For example, this is the reachable object summary output for my local repo:
REACHABLE OBJECT SUMMARY
========================
Object Type | Count
------------+-------
Tags | 1343
Commits | 179344
Trees | 314350
Blobs | 184030
Signed-off-by: Derrick Stolee <stolee@gmail.com>
When 'git survey' provides information to the user, this will be presented
in one of two formats: plaintext and JSON. The JSON implementation will be
delayed until the functionality is complete for the plaintext format.
The most important parts of the plaintext format are headers specifying the
different sections of the report and tables providing concreted data.
Create a custom table data structure that allows specifying a list of
strings for the row values. When printing the table, check each column for
the maximum width so we can create a table of the correct size from the
start.
The table structure is designed to be flexible to the different kinds of
output that will be implemented in future changes.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
By default we will scan all references in "refs/heads/", "refs/tags/"
and "refs/remotes/".
Add command line opts let the use ask for all refs or a subset of them
and to include a detached HEAD.
Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Start work on a new 'git survey' command to scan the repository
for monorepo performance and scaling problems. The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.
The initial goal is to complement the scanning and analysis performed
by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.
Co-authored-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
The sparse tree walk algorithm was created in d5d2e93577e (revision:
implement sparse algorithm, 2019-01-16) and involves using the
mark_trees_uninteresting_sparse() method. This method takes a repository
and an oidset of tree IDs, some of which have the UNINTERESTING flag and
some of which do not.
Create a method that has an equivalent set of preconditions but uses a
"dense" walk (recursively visits all reachable trees, as long as they
have not previously been marked UNINTERESTING). This is an important
difference from mark_tree_uninteresting(), which short-circuits if the
given tree has the UNINTERESTING flag.
A use of this method will be added in a later change, with a condition
set whether the sparse or dense approach should be used.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
As reported in https://lore.kernel.org/git/ZuPKvYP9ZZ2mhb4m@pks.im/,
libcurl v8.10.0 had a regression that was picked up by Git's t5559.30
"large fetch-pack requests can be sent using chunked encoding".
This bug was fixed in libcurl v8.10.1.
Sadly, the macos-13 runner image was updated in the brief window between
these two libcurl versions, breaking each and every CI build, as
reported at https://github.com/git-for-windows/git/issues/5159.
This would usually not matter, we would just ignore the failing CI
builds until the macos-13 runner image is rebuilt in a couple of days,
and then the CI builds would succeed again.
However.
As has become the custom, a surprise Git version was released, and now
that Git for Windows wants to follow suit, since Git for Windows has
this custom of trying to never release a version with a failing CI
build, we _must_ work around it.
This patch implements this work-around, basically for the sake of Git
for Windows v2.46.2's CI build.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
These fixes have been sent to the Git mailing list but have not been
picked up by the Git project yet.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
In some implementations, `regexec_buf()` assumes that it is fed lines;
Without `REG_NOTEOL` it thinks the end of the buffer is the end of a
line. Which makes sense, but trips up this case because we are not
feeding lines, but rather a whole buffer. So the final newline is not
the start of an empty line, but the true end of the buffer.
This causes an interesting bug:
$ echo content >file.txt
$ git grep --no-index -n '^$' file.txt
file.txt:2:
This bug is fixed by making the end of the buffer consistently the end
of the final line.
The patch was applied from
https://lore.kernel.org/git/20250113062601.GD767856@coredump.intra.peff.net/
Reported-by: Olly Betts <olly@survex.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This addresses:
- CVE-2024-52005:
Insufficient neutralization of ANSI escape sequences in sideband
payload can be used to mislead Git users into believing that
certain remote-generated messages actually originate from Git.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
When a Unix socket is initialized, the current directory's path is
stored so that the cleanup code can `chdir()` back to where it was
before exit.
If the path that needs to be stored exceeds the default size of the
`sun_path` attribute of `struct sockaddr_un` (which is defined as a
108-sized byte array on Linux), a larger buffer needs to be allocated so
that it can hold the path, and it is the responsibility of the
`unix_sockaddr_cleanup()` function to release that allocated memory.
In Git's CI, this stack allocation is not necessary because the code is
checked out to `/home/runner/work/git/git`. Concatenate the path
`t/trash directory.t0301-credential-cache/.cache/git/credential/socket`
and a terminating NUL, and you end up with 96 bytes, 12 shy of the
default `sun_path` size.
However, I use worktrees with slightly longer paths:
`/home/me/projects/git/yes/i/nest/worktrees/to/organize/them/` is more
in line with what I have. When I recently tried to locally reproduce a
failure of the `linux-leaks` CI job, this t0301 test failed (where it
had not failed in CI).
The reason: When `credential-cache` tries to reach its daemon initially
by calling `unix_sockaddr_init()`, it is expected that the daemon cannot
be reached (the idea is to spin up the daemon in that case and try
again). However, when this first call to `unix_sockaddr_init()` fails,
the code returns early from the `unix_stream_connect()` function
_without_ giving the cleanup code a chance to run, skipping the
deallocation of above-mentioned path.
The fix is easy: do not return early but instead go directly to the
cleanup code.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The preceding two commits introduced special handling of the sideband
channel to neutralize ANSI escape sequences before sending the payload
to the terminal, and `sideband.allowControlCharacters` to override that
behavior.
However, some `pre-receive` hooks that are actively used in practice
want to color their messages and therefore rely on the fact that Git
passes them through to the terminal.
In contrast to other ANSI escape sequences, it is highly unlikely that
coloring sequences can be essential tools in attack vectors that mislead
Git users e.g. by hiding crucial information.
Therefore we can have both: Continue to allow ANSI coloring sequences to
be passed to the terminal, and neutralize all other ANSI escape
sequences.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The preceding commit fixed the vulnerability whereas sideband messages
(that are under the control of the remote server) could contain ANSI
escape sequences that would be sent to the terminal verbatim.
However, this fix may not be desirable under all circumstances, e.g.
when remote servers deliberately add coloring to their messages to
increase their urgency.
To help with those use cases, give users a way to opt-out of the
protections: `sideband.allowControlCharacters`.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The output of `git clone` is a vital component for understanding what
has happened when things go wrong. However, these logs are partially
under the control of the remote server (via the "sideband", which
typically contains what the remote `git pack-objects` process sends to
`stderr`), and is currently not sanitized by Git.
This makes Git susceptible to ANSI escape sequence injection (see
CWE-150, https://cwe.mitre.org/data/definitions/150.html), which allows
attackers to corrupt terminal state, to hide information, and even to
insert characters into the input buffer (i.e. as if the user had typed
those characters).
To plug this vulnerability, disallow any control character in the
sideband, replacing them instead with the common `^<letter/symbol>`
(e.g. `^[` for `\x1b`, `^A` for `\x01`).
There is likely a need for more fine-grained controls instead of using a
"heavy hammer" like this, which will be introduced subsequently.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
* 'jx/zh_CN' of github.com:jiangxin/git:
l10n: zh_CN: standardize glossary terms
l10n: zh_CN: updated translation for 2.53
l10n: zh_CN: fix inconsistent use of standard vs. wide colons
Add preferred Chinese terminology notes and align existing translations
to the updated glossary. AI-assisted review was used to check and
improve legacy translations.
Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
Replace mixed usage of standard (ASCII) colons ':' with full-width
(wide) colons ':' in Chinese translations to ensure typographic
consistency, as reported by CAESIUS-TIM [1].
Full-width punctuation is preferred in Chinese localization for better
readability and adherence to typesetting conventions.
[1]: https://github.com/git-l10n/git-po/issues/884
Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
Originally introduced as `core.useBuiltinFSMonitor` in Git for Windows
and developed, improved and stabilized there, the built-in FSMonitor
only made it into upstream Git (after unnecessarily long hemming and
hawing and throwing overly perfectionist style review sticks into the
spokes) as `core.fsmonitor = true`.
In Git for Windows, with this topic branch, we re-introduce the
now-obsolete config setting, with warnings suggesting to existing users
how to switch to the new config setting, with the intention to
ultimately drop the patch at some stage.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This topic branch re-adds the deprecated --stdin/-z options to `git
reset`. Those patches were overridden by a different set of options in
the upstream Git project before we could propose `--stdin`.
We offered this in MinGit to applications that wanted a safer way to
pass lots of pathspecs to Git, and these applications will need to be
adjusted.
Instead of `--stdin`, `--pathspec-from-file=-` should be used, and
instead of `-z`, `--pathspec-file-nul`.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
A fix for calling `vim` in Windows Terminal caused a regression and was
reverted. We partially un-revert this, to get the fix again.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This is the recommended way on GitHub to describe policies revolving around
security issues and about supported versions.
Helped-by: Sven Strickroth <email@cs-ware.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Reintroduce the 'core.useBuiltinFSMonitor' config setting (originally added
in 0a756b2a25 (fsmonitor: config settings are repository-specific,
2021-03-05)) after its removal from the upstream version of FSMonitor.
Upstream, the 'core.useBuiltinFSMonitor' setting was rendered obsolete by
"overloading" the 'core.fsmonitor' setting to take a boolean value. However,
several applications (e.g., 'scalar') utilize the original config setting,
so it should be preserved for a deprecation period before complete removal:
* if 'core.fsmonitor' is a boolean, the user is correctly using the new
config syntax; do not use 'core.useBuiltinFSMonitor'.
* if 'core.fsmonitor' is unspecified, use 'core.useBuiltinFSMonitor'.
* if 'core.fsmonitor' is a path, override and use the builtin FSMonitor if
'core.useBuiltinFSMonitor' is 'true'; otherwise, use the FSMonitor hook
indicated by the path.
Additionally, for this deprecation period, advise users to switch to using
'core.fsmonitor' to specify their use of the builtin FSMonitor.
Signed-off-by: Victoria Dye <vdye@github.com>
Git for Windows accepts pull requests; Core Git does not. Therefore we
need to adjust the template (because it only matches core Git's
project management style, not ours).
Also: direct Git for Windows enhancements to their contributions page,
space out the text for easy reading, and clarify that the mailing list
is plain text, not HTML.
Signed-off-by: Philip Oakley <philipoakley@iee.org>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Rather than using private IFTTT Applets that send mails to this
maintainer whenever a new version of a Git for Windows component was
released, let's use the power of GitHub workflows to make this process
publicly visible.
This workflow monitors the Atom/RSS feeds, and opens a ticket whenever a
new version was released.
Note: Bash sometimes releases multiple patched versions within a few
minutes of each other (i.e. 5.1p1 through 5.1p4, 5.0p15 and 5.0p16). The
MSYS2 runtime also has a similar system. We can address those patches as
a group, so we shouldn't get multiple issues about them.
Note further: We're not acting on newlib releases, OpenSSL alphas, Perl
release candidates or non-stable Perl releases. There's no need to open
issues about them.
Co-authored-by: Matthias Aßhauer <mha1993@live.de>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This patch introduces support to set special NTFS attributes that are
interpreted by the Windows Subsystem for Linux as file mode bits, UID
and GID.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
With improvements by Clive Chan, Adric Norris, Ben Bodenmiller and
Philip Oakley.
Helped-by: Clive Chan <cc@clive.io>
Helped-by: Adric Norris <landstander668@gmail.com>
Helped-by: Ben Bodenmiller <bbodenmiller@hotmail.com>
Helped-by: Philip Oakley <philipoakley@iee.org>
Signed-off-by: Brendan Forster <brendan@github.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>