A deep dive into unicode and string matching - II
In the previous entry of this series I went through a lightning tour of what is Unicode and provided some details into the various encodings that are part of the standard (UTF-8/16/32). This serves as the baseline knowledge for further exploration of how Unicode strings work, and some of the interesting problems that arise in this space.
Codespace organization
Before we proceed with more practical aspects of Unicode string matching, I would like just to make a brief tangent for completeness sake and briefly touch upon how code points are organized.
As we previously discussed, the unicode standard allows for more than a million code points1 (1114112 to be precise). The next question is of course, how is this code point space (codespace in Unicode parlance) organized internally? And why does that organization matter?
The codespace is not just a linear collection of code points. Characters are grouped by their attributes, such as script or writing system, and the highest level group is the "plane", which corresponds to 64k code points.
Plane 0, or the Basic Multilingual Plane (BMP), encodes most characters in current use (as well as some historical characters) in the first 64k code points. A nice side-effect of this is that it is possible to effectively support all current languages with a 16 bit fixed character size - although forgetting that UTF-162 is a variable length encoding can land you in trouble!
Beyond the BMP there are several other planes: The Supplementary Multilingual Plane (SMP or Plane 1) encodes seldomly used characters that didn't fit into the BMP, historical scripts and pictographic symbols. Beyond Plane 1, we find the Ideographic Plan (Plane 2) and the Tertiary Ideographic Plan (Plane 3) that encode Chinese, Japanese and Korean character (CJK) that are used for less frequent CJK characters that don't fit in the BMP. And finally we have the Supplementary Special Purpose Plane (SSP, plane 14) used as a spillover for format control characters that don't fit in the BMP and finally two Private Use planes (Planes 15 and 16) that are allocated for private use and expand on the private use characters located in the BMP.
Internally, each plane is arranged into several blocks, so for instance in BMP the area from 0x0000 to Ox00FF (the first 256 code points) match the ISO Latin-1 and ASCII encoding for retro compatibility.
Diacritics and other "strange" markings
Okay, back to the regular scheduled content: an aspect to consider is how Unicode deals with diacritics (and other marks and symbols). For the Latin alphabet this would probably be trivial (as we've seen Unicode is compatible with the ISO 8859-1 / Latin-1 encoding), however this is far from an extensible mechanism, so Unicode introduces the concept of combining characters, which are essentially marks that are placed relative to a base character. The convention is that the combining characters are applied after the base character.
An interesting fact is that more than one combining character may be applied to a single base character, this can open the door to some very creative uses like building an audio spectrum analyzer bar using the fact that you can "stack" combining characters. The code needs some tweaking and I will update it later:
There are exceptions to this principle especially due to retro compatibility reasons, so this means that there are different equivalent sequences.
For instance the character: ç
can be represented by the code point U+00E7
or the U+0063
(the c
) followed by U+0327
(the cedilla).
Now this poses an interesting question, are vanilla string classes in popular programming languages aware of this when making string comparisons?
Let's start with a basic example to see how this actually works (the content below is rendered in a React component):
Encoding using a single code point: ç
Encoding using a combining characters: ç
Comparison using === : False
If you want to test it yourself, you can past the following code in your browser's debug console:
const singleCodePoint = String.fromCodePoint(0xE7);
const combiningCharacters = String.fromCodePoint(0x63, 0x327);
console.log("Single code input: " + singleCodePoint);
console.log("Combining characters: " + combiningCharacters);
console.log(singleCodePoint === combiningCharacters);
One could say this is a Javascript quirk, however that is not the case. If you have Python installed in your system (please use Python 3) you can test the following code:
singleCodePoint = chr(0xE7)
combiningCharacters = chr(0x63) + chr(0x327)
print("Single code input: " + singleCodePoint)
print("Combining characters: " + combiningCharacters)
print(singleCodePoint == combiningCharacters)
Clearly vanilla string comparison fails for strings that are visually and semantically equivalent3, which is not good and may break applications in weird and wonderful ways (e.g. think about the effects of this in data structures like sets or maps/dictionaries).
And if you think that this is an exclusive of those pesky interpreted languages, well, even the trusty old compareToIgnoreCase
in Java
fails this test:
void main() {
String singleCodePoint = new String(new int[]{0xE7}, 0, 1);
String combiningCharacters = new String(new int[]{0x63, 0x327}, 0, 1);
System.out.println("Single code input: " + singleCodePoint);
System.out.println("Combining characters: " + combiningCharacters);
System.out.println(singleCodePoint.compareToIgnoreCase(combiningCharacters) == 0);
}
To run this you can simply paste the code above to a .java
file, in this case Main.java
and compile it:
$ javac --source 21 --enable-preview Main.java
$ java --enable-preview Main
Unsurprisingly at this point the last line outputs false
meaning that both strings are not equal. So what can be done about this?
Normalization
Clearly comparing Unicode strings is not as straightforward as one may think, especially when dealing with strings that can be considered to be equivalent (as in the examples above). Fortunately the Unicode standard defines algorithms to create normalized forms that eliminate unwanted distinctions.
In order to understand how Unicode normalization works it's important to understand the concepts of canonical equivalence and compatibility equivalence.
Canonical equivalence: Two strings are said to be canonical equivalents if their full canonical decompositions are identical. For example:
- Combining sequences:
0x00E7
is equivalent to0x0063, 0x0327
- Ordering of combining marks:
q+◌̇+◌̣
is equivalent toq+◌̣+◌̇
- Singleton equivalence:
U+212B
(Angstrom Sign) is equivalent toU+00C5
(Latin Capital Letter a with Ring Above). In the normalization process singletons will be replaced. - Hangul & conjoining jamo
Note that language specific rules for matching and ordering may treat letters differently from the canonical equivalence (more on that in a later post).
Compatibility equivalence: Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical. This is a weaker type of equivalence, so greater care should be taken to ensure the equivalence is appropriate. A compatibility decomposition is an algorithm that maps an input character both the canonical mapping and the compatibility mappings found in the Unicode Character Database. For example4:
- Font variants
- Linebreaking differences
- Positional forms
- Circled variants
- Width variants
- Rotated variants
- Superscripts/Subscripts
- Squared characters
- Fractions
Unicode offers four normalization forms (NF)5, that either try to break apart composite characters (decomposition) or convert to composite characters (composition):
normalization Form | Type | Description | Example |
---|---|---|---|
NFD | Decomposition | Canonical decomposition of a string | U+00C5 is equivalent to U+0041, U+030A |
NFKD | Decomposition | Compatibility decomposition of a string (in many cases this will wield similar results NFD) | U+FB01 is equivalent to U+0066, U+0069 |
NFC | Composition | Canonical composition after the canonical decomposition of a string. | U+0041, U+030A is equivalent to U+00C5 |
NFKC | Composition | Compatibility composition after the canonical decomposition of a string. | U+1E9B, U+0323 is equivalent to 1E69 |
The following example (rendered in a React component) shows the normalization forms in action6:
NFD of Å (U+212B) = Å (U+0041, U+030A)
NFKD of fi (U+FB01) = fi (U+0066, U+0069)
NFC of Å (U+0041, U+030A) = Å (U+00C5)
NFKC of ẛ̣ (U+1E9B, U+0323) = ṩ (U+1E69)
What does this mean in practice?
The normalization forms perform modifications to the text and may result in the loss of important semantic information, so they are best used like the typical uppercase and lowercase modifications, i.e. definitely very useful, but not always appropriate dependending on the context.
In the next post in this series we're going to apply normalization, plus a few other tricks for more realistic scenarios such as matching of names, so stay tuned!
Footnotes
-
Recall that code points correspond to characters and: "Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation". Chapter 2 of the Unicode standard offers an interesting discussion of the underlying design philosophy of the standard as well as some notable situations where deviations from key principles were required in order to ensure retro compatibility (section 2.3 compatibility characters). ↩
-
Here is a very interesting design document from around the time the Java Platform added support for characters in the SMP and beyond (requiring more than 16 bits per char). ↩
-
Malicious actors can take this one step further and craft payloads that leverage non-printable or graphically similar characters. See this technical report, in particular the section about "confusables" for further detail. ↩
-
Check chapter 3, section 11 of the unicode standard for more details on normalization forms. ↩