This upgrades to [check-spelling v0.0.24].
A number of GitHub APIs are being turned off shortly, so we need to
upgrade or various uncertain outcomes will occur.
There are some minor bugs that I'm aware of and which I've fixed since
this release (including a couple I discovered while preparing this PR).
There's a new accessibility forbidden pattern:
#### Should be `cannot` (or `can't`)
See https://www.grammarly.com/blog/cannot-or-can-not/
> Don't use `can not` when you mean `cannot`. The only time you're
likely to see `can not` written as separate words is when the word `can`
happens to precede some other phrase that happens to start with `not`.
> `Can't` is a contraction of `cannot`, and it's best suited for
informal writing.
> In formal writing and where contractions are frowned upon, use
`cannot`.
> It is possible to write `can not`, but you generally find it only as
part of some other construction, such as `not only . . . but also.`
- if you encounter such a case, add a pattern for that case to
patterns.txt.
```
\b[Cc]an not\b
```
[check-spelling v0.0.24]: https://github.com/check-spelling/check-spelling/releases/tag/v0.0.24
Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com>
My long-term plan is to replace the `CodepointWidth` enum with a simple integer
return value that indicates the amount of columns a codepoint is wide.
This is necessary so that we can return 0 for ZWJs (zero width joiners).
This initial commit represents a cleanup effort around `CodepointWidthDetector`.
Since less code runs faster, this change has the nice side-effect of running
roughly 5-10% faster across the board. It also drops the binary size by ~1.2kB.
## Validation Steps Performed
* `CodepointWidthDetectorTests` passes ✅
* U+26bf (``"`u{26bf}"`` inside pwsh) is a wide glyph
in OpenConsole and narrow one in Windows Terminal ✅
This commit also adds an override UCD and migrates all of the overrides
from GetQuickCharWidth into it.
GetQuickCharWidth
-----------------
The removal of overrides from GQCW reduces the number of comparisons
required for looking up a single character's width from 41 (32
individual ranged comparisons from GQCW + 8+1 from the binary search in
CPWD) to 11 (2 from GQCW, 8+1 from CPWD).
GQCW also incorrectly marked 67 reserved codepoints as `Wide` when they
should have been `Narrow`.
The codepoints whose definitions have changed from `Wide` to `Narrow` are:
```
2E9A 2EF4 2EF5 2EF6 2EF7 2EF8 2EF9 2EFA 2EFB 2EFC 2EFD 2EFE 2EFF 2FD6
2FD7 2FD8 2FD9 2FDA 2FDB 2FDC 2FDD 2FDE 2FDF 2FE0 2FE1 2FE2 2FE3 2FE4
2FE5 2FE6 2FE7 2FE8 2FE9 2FEA 2FEB 2FEC 2FED 2FEE 2FEF 2FFC 2FFD 2FFE
2FFF 31E4 31E5 31E6 31E7 31E8 31E9 31EA 31EB 31EC 31ED 31EE 31EF 321F
A48D A48E A48F FE1A FE1B FE1C FE1D FE1E FE1F FE53 FE67
```
All of them are reserved, but those reserved regions are marked as narrow
in the UCD.
This change also offers us the chance to document exactly why we're
overriding a specific character range. Comments from the override
document will be copied to the generated CPWD table.
New in Unicode 13.0
------------------
Some widths have changed due to previously-reserved characters becoming
_used_ such as U+32FF SQUARE ERA NAME REIWA, the Tangut components
756-768, the entire Khitan Small Script character set, and the Tangut
Ideographs.
A number of the changes in this diff are due to better/worse comment
tracking and the removal of the Emoji/EPres comments. The script once
mistakenly applied comments to packed regions (and it has been updated
to not do so.)
Validation
----------
I build a test application that compared codepoints 0-FFFF for GQCW
against their new registered widths.
This commit introduces Generate-CodepointWidthsFromUCD, a powershell
(7+) script that will parse a UCD XML database in the UAX 42 format from
https://www.unicode.org/Public/UCD/latest/ucdxml/ and generate
CodepointWidthDetector's giant width array.
By default, it will emit one UnicodeRange for every range of non-narrow
glyphs with a different Width + Emoji + Emoji Presentation class;
however, it can be run in "packing" and "full" mode.
* Packing mode: ignore the width/emoji/pres class and combine adjacent
runs that CPWD will treat the same.
* This is for optimizing the number of individual ranges emitted
into code.
* Full mode: include narrow codepoints (helpful for visualization)
It also supports overrides, provided in an XML document of the same format
as the UCD itself. Entries in the overrides files are applied after the
entire UCD is read and will replace any impacted ranges.
The output (when packing) looks like this:
```c++
// Generated by Generate-CodepointWidthsFromUCD -Pack:True -Full:False
// on 05/17/2020 02:47:55 (UTC) from Unicode 13.0.0.
// 66182 (0x10286) codepoints covered.
static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{
UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous },
UnicodeRange{ 0xa4, 0xa4, CodepointWidth::Ambiguous },
UnicodeRange{ 0xa7, 0xa8, CodepointWidth::Ambiguous },
.
.
.
UnicodeRange{ 0x1f210, 0x1f23b, CodepointWidth::Wide },
UnicodeRange{ 0x1f37e, 0x1f393, CodepointWidth::Wide },
UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous },
};
```
The output (when overriding) looks like this:
```c++
// Generated by Generate-CodepointWidthsFromUCD.ps1 -Pack:True -Full:False -NoOverrides:False
// on 5/22/2020 11:17:39 PM (UTC) from Unicode 13.0.0.
// 321205 (0x4E6B5) codepoints covered.
// 240 (0xF0) codepoints overridden.
static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{
UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous },
...
UnicodeRange{ 0xfe20, 0xfe2f, CodepointWidth::Narrow }, // narrow combining ligatures (split into left/right halves, which take 2 columns together)
...
UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous },
};
```