icu_collator/
docs.rs

1// This file is part of ICU4X. For terms of use, please see the file
2// called LICENSE at the top level of the ICU4X source tree
3// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).
4
5//! This module exists to contain implementation docs and notes for people who want to contribute.
6//!
7//! # Contributor Notes
8//!
9//! ## Development environment (on Linux) for fuzzing and generating data
10//!
11//! These notes assume that ICU4X itself has been cloned to `$PROJECTS/icu4x`.
12//!
13//! Clone ICU4C from <https://github.com/hsivonen/icu> to `$PROJECTS/icu` and switch
14//! to the branch `icu4x-collator`.
15//!
16//! Create a directory `$PROJECTS/localicu`
17//!
18//! Create a directory `$PROJECTS/icu-build` and `cd` into it.
19//!
20//! Run `../icu/icu4c/source/runConfigureICU --enable-debug Linux --prefix $PROJECTS/localicu --enable-static`
21//!
22//! Run `make`
23//!
24//! ### Generating data
25//!
26//!
27//!
28//! ### Testing
29//!
30//! `cargo test --features serde`
31//!
32//! Note: some tests depend on collation test data files.
33//! These files are copied from the ICU and CLDR codebases,
34//! and they are stored in `tests/data/`.
35//! New versions of collation data from CLDR/ICU are kept in sync with these collation test data files.
36//! When updating ICU4X to pick up new Unicode data, including collation data, from ICU,
37//! the copies of collation test data files in maintained in ICU4X's icu::collator will need to be overridden with their newer corresponding versions.
38//! See the Readme in `/tests/data/README.md` for details.
39//!
40//! ### Fuzzing
41//!
42//! `cargo install cargo-fuzz`
43//!
44//! Clone `rust_icu` from <https://github.com/google/rust_icu> to `$PROJECTS/rust_icu`.
45//!
46//! In `$PROJECTS/icu-build` run `make install`.
47//!
48//! `cd $PROJECTS/icu4x/components/collator`
49//!
50//! Run the fuzzer until a panic:
51//!
52//! `PKG_CONFIG_PATH="$PROJECTS/localicu/lib/pkgconfig" PATH="$PROJECTS/localicu/bin:$PATH" LD_LIBRARY_PATH="/$PROJECTS/localicu/lib" RUSTC_BOOTSTRAP=1 cargo +stable fuzz run compare_utf16`
53//!
54//! Once there is a panic, recompile with debug symbols by adding `--dev`:
55//!
56//! `PKG_CONFIG_PATH="$PROJECTS/localicu/lib/pkgconfig" PATH="$PROJECTS/localicu/bin:$PATH" LD_LIBRARY_PATH="$PROJECTS/localicu/lib" RUSTC_BOOTSTRAP=1 cargo +stable fuzz run --dev compare_utf16 fuzz/artifacts/compare_utf16/crash-$ARTIFACTHASH`
57//!
58//! Record with
59//!
60//! `LD_LIBRARY_PATH="$PROJECTS/localicu/lib" rr fuzz/target/x86_64-unknown-linux-gnu/debug/compare_utf16 -artifact_prefix=$PROJECTS/icu4x/components/collator/fuzz/artifacts/compare_utf16/ fuzz/artifacts/compare_utf16/crash-$ARTIFACTHASH`
61//!
62//! # Design notes
63//!
64//! * The collation element design comes from ICU4C. Some parts of the ICU4C design, notably,
65//!   `Tag::BuilderDataTag`, `Tag::LeadSurrogateTag`, `Tag::LatinExpansionTag`, `Tag::U0000Tag`,
66//!   and `Tag::HangulTag` are unused.
67//!   - `Tag::LatinExpansionTag` might be reallocated to search expansions for archaic jamo
68//!     in the future.
69//!   - `Tag::HangulTag` might be reallocated to compressed hanja expansions in the future.
70//!     See [issue 1315](https://github.com/unicode-org/icu4x/issues/1315).
71//! * The key design difference between ICU4C and ICU4X is that ICU4C puts the canonical
72//!   closure in the data (larger data) to enable lookup directly by precomposed characters
73//!   while ICU4X always omits the canonical closure and always normalizes to NFD on the fly.
74//! * Compared to ICU4C, normalization cannot be turned off. There also isn't a separate
75//!   "Fast Latin" mode.
76//! * The normalization is fused into the collation element lookup algorithm to optimize the
77//!   case where an input character decomposes into two BMP characters: a base letter and a
78//!   diacritic.
79//!   - To optimize away a trie lookup when the combining diacritic doesn't contract,
80//!     there is a linear lookup table for the combining diacritics block. Three languages
81//!     tailor diacritics: Ewe, Lithuanian, and Vietnamese. Vietnamese and Ewe load an
82//!     alternative table. The Lithuanian special cases are hard-coded and activatable by
83//!     a metadata bit.
84//! * Unfortunately, contractions that contract starters don't fit this model nicely. Therefore,
85//!   there's duplicated normalization code for normalizing the lookahead for contractions.
86//!   This code can, in principle, do duplicative work, but it shouldn't be excessive with
87//!   real-world inputs.
88//! * As a result, in terms of code provenance, the algorithms come from ICU4C, except the
89//!   normalization part of the code is novel to ICU4X, and the contraction code is custom
90//!   to ICU4X despite being informed by ICU4C.
91//! * The way input characters are iterated over and resulting collation elements are
92//!   buffered is novel to ICU4X.
93//! * ICU4C can iterate backwards but ICU4X cannot. ICU4X keeps a buffer of the two most
94//!   recent characters for handling prefixes. As of CLDR 40, there were only two kinds
95//!   of prefixes: a single starter and a starter followed by a kana voicing mark.
96//! * ICU4C sorts unpaired surrogates in their lexical order. ICU4X operates on Unicode
97//!   [scalar values](https://unicode.org/glossary/#unicode_scalar_value) (any Unicode
98//!   code point except high-surrogate and low-surrogate code points), so unpaired
99//!   surrogates sort as REPLACEMENT CHARACTERs. Therefore, all unpaired
100//!   surrogates are equal with each other.
101//! * Skipping over a bit-identical prefix and then going back over "backward-unsafe"
102//!   characters is currently unimplemented but isn't architecturally precluded.
103//! * Hangul is handled specially:
104//!   - Precomposed syllables are checked for as the first step of processing an
105//!     incoming character.
106//!   - Individual jamo are lookup up from a linear table instead of a trie. Unlike
107//!     in ICU4C, this table covers the whole Unicode block whereas in ICU4C it covers
108//!     only modern jamo for use in decomposing the precomposed syllables. The point
109//!     is that search collations have a lot of duplicative (across multiple search)
110//!     collations data for making archaic jamo searchable by modern jamo.
111//!     Unfortunately, the shareable part isn't currently actually shareable, because
112//!     the tailored CE32s refer to the expansions table in each collation. To make
113//!     them truly shareable, the archaic jamo expansions need to become self-contained
114//!     the way Latin mini expansions in ICU4C are self-contained.
115//!
116//!     One possible alternative to loading a different table for "search" would be
117//!     performing the mapping of archaic jamo to the modern approximations as a
118//!     special preprocessing step for the incoming characters, which would allow
119//!     the lookup of the resulting modern jamo from the normal root jamo table.
120//!
121//!     "searchjl" is even more problematic than "search", since "searchjl" uses
122//!     prefixes matches with jamo, and currently Hangul is assumed not to participate
123//!     in prefix or contraction matching.
124//!
125//! # Notes about index generation
126//!
127//! ICU4X currently does not have code or data for generating [collation
128//! indexes](https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Indexes).
129//!
130//! On the data side, ICU4X doesn't have data for `<exemplarCharacters type="index">`
131//! (or when that's missing, plain `<exemplarCharacters>`).
132//!
133//! Of the collations, `zh-u-co-pinyin`, `zh-u-co-stroke`, `zh-u-co-zhuyin`, and
134//! `*-u-co-unihan` are special: They bake a contraction of U+FDD0 and an index
135//! character in the collation order. ICU4X collation data already includes this.
136//! For `*-u-co-unihan` this index character data is repeated in all three tailorings
137//! instead of being in the root. If it was in the root, code for extracting the
138//! index characters from the collation data would need to avoid confusing the
139//! `unihan` index contractions (if they were in the root) and the `zh-u-co-pinyin`,
140//! `zh-u-co-stroke`, and `zh-u-co-zhuyin` in the tailoring. This seems feasible,
141//! but isn't how CLDR and ICU4C do it. (If the index characters for
142//! `*-u-co-unihan` were in the root, `ko-u-co-unihan` would become a mere
143//! script reordering.)
144//!
145//! It's unclear how useful it would be size-wise to have code to extract the
146//! index characters from the collations: For `zh-u-co-pinyin`, `zh-u-co-stroke`,
147//! `zh-u-co-zhuyin`, the index characters are contiguous ranges that could be
148//! efficiently stored as start and end. Moreover, the in-data index character
149//! for `stroke` isn't the label to be rendered to the user, so special-casing
150//! is needed anyway.
151//!
152//! This means that there's a tradeoff between having duplicate data (relative to
153//! the collation tailorings) for the `unihan` index character list vs. having
154//! code for extracting the list from the tailorings. It's not at all clear that
155//! having the code is better for size than having the list of 238 ideographs
156//! itself as data (476 bytes as UTF-16).
157//!
158//! Note: Investigate [#2723](https://github.com/unicode-org/icu4x/issues/2723)
icu_collator/docs.rs

icu_collator/
docs.rs