Skip to main content

3 posts tagged with "unicode"

Posts related to the Unicode standard or related experiments

View All Tags

Unicode audio analyzer

· 4 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

What does unicode, audio processing and admittedly bad early 2000s Internet memes have to do with one another?

In the previous post in the deep dive into unicode series we explored how combining characters like diacritics work. One interesting property of unicode is that it is possible to combine multiple combining characters together.

Stacked characters: ç̰̀́̂

The example above shows the letter C with several combining characters, and as we can see they stack up quite nicely. This is the basis for an Internet meme of the early 2000s called Zalgo text1. We can take this to the next level with a "winamp style" analyzer bar, but fully in (zalgo) text for an extra metal look and feel. 🤘 🤘

----------------------------------------------------------------

The reality is that this was really an excuse to play around with the Web Audio API2 and some modern React (I hadn't touched front-end development in a while), and there were a few learnings in the process.

Implementation and technicalities

From an implementation perspective the first challenge was to understand what the Web Audio API offers in terms of digital signal processing and how to use it. The documentation is excellent and gist of it is that audio operations happen in an Audio Context that represents an audio processing graph built from several AudioNodes linked together in such a way that the output of one node serves as the input for the next. Because I wanted to extract the frequency domain from the audio signal in order to render it on screen, I used an AnalyzerNode3, which doesn't modify the audio but returns data about the frequency domain using a trusty old FFT4.

The following code example puts all of these concepts together:

const context = new AudioContext();
const theAnalyser = context.createAnalyser();
const source = context.createMediaElementSource(audioNode.current);
// build the audio processing graph connecting the input source
// to the analyzer node, and the output of the analyzer to the
// output of the Audio Context.
source.connect(theAnalyser);
theAnalyser.connect(context.destination);

Another interesting learning was about the advantages of requestAnimationFrame5 (RAF) versus a plain old setInterval for rendering. Since in this case I wanted performant and smooth updates RAF was an interesting choice as its refresh rate tries to match the display's refresh rate and calls are paused when running in background tabs or hidden - meaning better performance and battery life.

Finally, why not put everything together in a nice NPM package? Since I don't usually work in the JS ecosystem it was a nice opportunity to get some hands-on experience with this. The npmjs6 documentation is very good and the setup was straightforward, especially if you've published packages in Maven Central, Artifactory or equivalent. Top marks there. You can find the package here: https://www.npmjs.com/package/@felix.bruno/zalgo-player and installation is of course super easy:

$ npm install @felix.bruno/zalgo-player

This being the Javascript/Typescript ecosystem not everything was smooth sailing and I discovered that Create React App7 still doesn't support Typescript 5, and Github seems a bit dead which is a bit of a bummer. After spending some time looking Vite8 seemed like a decent choice to set up a basic react library with properly configured Typescript support.

In this case, since I wanted to publish only a React component and not a full-blown web application I had to make some changes to what Vite offers out-of-the-box9, but I am quite happy with the end result. The npm module is less than 15KB uncompressed, and has no dependencies (since this is a React component, it can only be used in that context, and thus we don't need to ship React with the package).

The code is available in Github: https://github.com/felix19350/zalgo-player

Note: In a next iteration I will work a bit to enable the component to be responsive, so if you view this in a mobile phone this may not render very well.


Footnotes

  1. Zalgo - Wikipedia

  2. Web Audio API docs

  3. Web audio visualizations and AnalyzerNode docs

  4. FFT - Fast Fourier Transform. This video provides a nice intuition for how Fourier Transforms work in general, so go watch it!

  5. requestAnimationFrame documentation

  6. npmjs documentation

  7. Create React App

  8. Vite

  9. This article was quite helpful to get me up-to-speed on the changes that I needed to make in order to publish the ZalgoPlayer component as a library.

A deep dive into unicode and string matching - II

· 8 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

In the previous entry of this series I went through a lightning tour of what is Unicode and provided some details into the various encodings that are part of the standard (UTF-8/16/32). This serves as the baseline knowledge for further exploration of how Unicode strings work, and some of the interesting problems that arise in this space.

Codespace organization

Before we proceed with more practical aspects of Unicode string matching, I would like just to make a brief tangent for completeness sake and briefly touch upon how code points are organized.

As we previously discussed, the unicode standard allows for more than a million code points1 (1114112 to be precise). The next question is of course, how is this code point space (codespace in Unicode parlance) organized internally? And why does that organization matter?

The codespace is not just a linear collection of code points. Characters are grouped by their attributes, such as script or writing system, and the highest level group is the "plane", which corresponds to 64k code points.

Plane 0, or the Basic Multilingual Plane (BMP), encodes most characters in current use (as well as some historical characters) in the first 64k code points. A nice side-effect of this is that it is possible to effectively support all current languages with a 16 bit fixed character size - although forgetting that UTF-162 is a variable length encoding can land you in trouble!

Beyond the BMP there are several other planes: The Supplementary Multilingual Plane (SMP or Plane 1) encodes seldomly used characters that didn't fit into the BMP, historical scripts and pictographic symbols. Beyond Plane 1, we find the Ideographic Plan (Plane 2) and the Tertiary Ideographic Plan (Plane 3) that encode Chinese, Japanese and Korean character (CJK) that are used for less frequent CJK characters that don't fit in the BMP. And finally we have the Supplementary Special Purpose Plane (SSP, plane 14) used as a spillover for format control characters that don't fit in the BMP and finally two Private Use planes (Planes 15 and 16) that are allocated for private use and expand on the private use characters located in the BMP.

Internally, each plane is arranged into several blocks, so for instance in BMP the area from 0x0000 to Ox00FF (the first 256 code points) match the ISO Latin-1 and ASCII encoding for retro compatibility.

Diacritics and other "strange" markings

Okay, back to the regular scheduled content: an aspect to consider is how Unicode deals with diacritics (and other marks and symbols). For the Latin alphabet this would probably be trivial (as we've seen Unicode is compatible with the ISO 8859-1 / Latin-1 encoding), however this is far from an extensible mechanism, so Unicode introduces the concept of combining characters, which are essentially marks that are placed relative to a base character. The convention is that the combining characters are applied after the base character.

An interesting fact is that more than one combining character may be applied to a single base character, this can open the door to some very creative uses like building an audio spectrum analyzer bar using the fact that you can "stack" combining characters. The code needs some tweaking and I will update it later:

----------------------------------------------------------------

There are exceptions to this principle especially due to retro compatibility reasons, so this means that there are different equivalent sequences.

For instance the character: ç can be represented by the code point U+00E7 or the U+0063 (the c) followed by U+0327 (the cedilla).

Now this poses an interesting question, are vanilla string classes in popular programming languages aware of this when making string comparisons?

Let's start with a basic example to see how this actually works (the content below is rendered in a React component):

Encoding using a single code point: ç

Encoding using a combining characters:

Comparison using === : False

If you want to test it yourself, you can past the following code in your browser's debug console:

const singleCodePoint = String.fromCodePoint(0xE7);
const combiningCharacters = String.fromCodePoint(0x63, 0x327);

console.log("Single code input: " + singleCodePoint);
console.log("Combining characters: " + combiningCharacters);
console.log(singleCodePoint === combiningCharacters);

One could say this is a Javascript quirk, however that is not the case. If you have Python installed in your system (please use Python 3) you can test the following code:

singleCodePoint = chr(0xE7)
combiningCharacters = chr(0x63) + chr(0x327)
print("Single code input: " + singleCodePoint)
print("Combining characters: " + combiningCharacters)
print(singleCodePoint == combiningCharacters)

Clearly vanilla string comparison fails for strings that are visually and semantically equivalent3, which is not good and may break applications in weird and wonderful ways (e.g. think about the effects of this in data structures like sets or maps/dictionaries).

And if you think that this is an exclusive of those pesky interpreted languages, well, even the trusty old compareToIgnoreCase in Java fails this test:

void main() {
String singleCodePoint = new String(new int[]{0xE7}, 0, 1);
String combiningCharacters = new String(new int[]{0x63, 0x327}, 0, 1);

System.out.println("Single code input: " + singleCodePoint);
System.out.println("Combining characters: " + combiningCharacters);
System.out.println(singleCodePoint.compareToIgnoreCase(combiningCharacters) == 0);
}

To run this you can simply paste the code above to a .java file, in this case Main.java and compile it:

$ javac --source 21 --enable-preview Main.java
$ java --enable-preview Main

Unsurprisingly at this point the last line outputs false meaning that both strings are not equal. So what can be done about this?

Normalization

Clearly comparing Unicode strings is not as straightforward as one may think, especially when dealing with strings that can be considered to be equivalent (as in the examples above). Fortunately the Unicode standard defines algorithms to create normalized forms that eliminate unwanted distinctions.

In order to understand how Unicode normalization works it's important to understand the concepts of canonical equivalence and compatibility equivalence.

Canonical equivalence: Two strings are said to be canonical equivalents if their full canonical decompositions are identical. For example:

  • Combining sequences: 0x00E7 is equivalent to 0x0063, 0x0327
  • Ordering of combining marks: q+◌̇+◌̣ is equivalent to q+◌̣+◌̇
  • Singleton equivalence: U+212B (Angstrom Sign) is equivalent to U+00C5 (Latin Capital Letter a with Ring Above). In the normalization process singletons will be replaced.
  • Hangul & conjoining jamo

Note that language specific rules for matching and ordering may treat letters differently from the canonical equivalence (more on that in a later post).

Compatibility equivalence: Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical. This is a weaker type of equivalence, so greater care should be taken to ensure the equivalence is appropriate. A compatibility decomposition is an algorithm that maps an input character both the canonical mapping and the compatibility mappings found in the Unicode Character Database. For example4:

  • Font variants
  • Linebreaking differences
  • Positional forms
  • Circled variants
  • Width variants
  • Rotated variants
  • Superscripts/Subscripts
  • Squared characters
  • Fractions

Unicode offers four normalization forms (NF)5, that either try to break apart composite characters (decomposition) or convert to composite characters (composition):

normalization FormTypeDescriptionExample
NFDDecompositionCanonical decomposition of a stringU+00C5 is equivalent to U+0041, U+030A
NFKDDecompositionCompatibility decomposition of a string (in many cases this will wield similar results NFD)U+FB01 is equivalent to U+0066, U+0069
NFCCompositionCanonical composition after the canonical decomposition of a string.U+0041, U+030A is equivalent to U+00C5
NFKCCompositionCompatibility composition after the canonical decomposition of a string.U+1E9B, U+0323 is equivalent to 1E69

The following example (rendered in a React component) shows the normalization forms in action6:

NFD of (U+212B) = (U+0041, U+030A)

NFKD of (U+FB01) = fi (U+0066, U+0069)

NFC of (U+0041, U+030A) = Å (U+00C5)

NFKC of ẛ̣ (U+1E9B, U+0323) = (U+1E69)

What does this mean in practice?

The normalization forms perform modifications to the text and may result in the loss of important semantic information, so they are best used like the typical uppercase and lowercase modifications, i.e. definitely very useful, but not always appropriate dependending on the context.

In the next post in this series we're going to apply normalization, plus a few other tricks for more realistic scenarios such as matching of names, so stay tuned!


Footnotes

  1. Recall that code points correspond to characters and: "Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation". Chapter 2 of the Unicode standard offers an interesting discussion of the underlying design philosophy of the standard as well as some notable situations where deviations from key principles were required in order to ensure retro compatibility (section 2.3 compatibility characters).

  2. Here is a very interesting design document from around the time the Java Platform added support for characters in the SMP and beyond (requiring more than 16 bits per char).

  3. Malicious actors can take this one step further and craft payloads that leverage non-printable or graphically similar characters. See this technical report, in particular the section about "confusables" for further detail.

  4. See Annex 15 of the Unicode standard

  5. Check chapter 3, section 11 of the unicode standard for more details on normalization forms.

  6. Check the normalize method of String.

A deep dive into unicode and string matching -I

· 8 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

"The ecology of the distributed high-tech workplace, home, or school is profoundly impacted by the relatively unstudied infrastructure that permeates all its functions" - Susan Leigh Star

Representing, processing, sending and receiving text (also known as strings in computer-speak) is one of the most common things computers do. Text representation and manipulation in a broad sense, has a quasi infrastructural1 quality to it, we all use it in one form or another and it generally works really well - so well in fact that we often don't pay attention to how it all works behind the scenes.

As software developers it is key to have a good grasp of how computers represent text, especially as in certain application domains even supposedly "simple" operations like assessing if two strings are the same has a surprising depth to it. After all should the string "Bruno Felix" be considered to be the same as the string "Bruno Félix" (note the acute "e" on the last name)? The answer of course is: "it depends".

But let's start from the beginning, in this series I am going to explore the Unicode standard, going a bit beyond the bare minimum every developer needs to know, and into some aspects that are not often not widely talked about, in particular about how characters with diacritics are represented, and how this impacts common operations like string comparisons and ordering.

The briefest history of text representations ever

There is already a lot of good material out there about how computers represent text2 which I highly recommend. For the sake of this article I'm going to speed run through the history of how computers represent text, in order to be able to dive in more detail at the internals of Unicode.

So in a very abbreviated manner: computers work with numbers, so the obvious thing to do to represent text is to assign a number to each letter. This is basically true to this day. Since the initial developments in digital computing were done in the USA and the UK, English became (and still is) the lingua franca of computing. It was straightforward to come up with a mapping between every letter in the English alphabet, digits, common punctuation marks plus some control characters, and a number. This mapping is called an encoding, and probably the oldest encoding you will find out there in the wild is ASCII, which is able to do exactly this for the English alphabet - and do so using only in 7 bits.

We can actually see this in action if we save the string Hello in ASCII in a file and dump its hexadecimal content.

$ hexdump -C hello-ascii.txt
00000000 48 65 6c 6c 6f |Hello|
00000005

Of course this was far from perfect, especially if you're not an English speaker. What about all the other languages out there in the world? And to do that in a way that maintains retro compatibility with ASCII?

Since computers work in multiples of 2 thus 8 bit bytes, this means ASCII leaves one bit available, the ASCII 7 bit encoding is quite easy to extend by adding an additional bit (nice as it maintains retro compatibility), doubling the number of characters available and still making everything fit in a single byte. Amazing! This is actually what vendors did, and this eventually got standardized in encodings like ISO 8859-1. Of course 255 characters is not enough to fit every character for every writing system out there, so the way this was initially approached was to have several "code pages", that essentially map the 1 byte number to different alphabets (e.g. the Latin alphabet has one code page, the Greek alphabet had another).A consequence of this is that in order to make sense of a piece of text one needs additional meta-information about which code page to use: after all character 233 will change depending on the code page (and of course it still doesn't work for alphabets with thousands of characters)

If we have the string Félix written in ISO-8859-1 (Latin) and read the same string in ISO-8859-7 (Greek) we get Fιlix, despite the fact that the bytes are exactly the same!

$ hexdump -C felix-code-page-confusion.txt
00000000 46 e9 6c 69 78 |F.lix|
00000005

A brief intro to Unicode and character encodings

The example above is not only an interoperability nightmare, but it still doesn't cover all languages in use so it's not a sustainable solution to the issue of text representation. Unicode3 tries to address these issues by starting from a simple idea: each character is assigned its own number, that is é and ι will be assigned different numbers (code points in Unicode terminology). Code points are typically represented by U+ followed by the hexadecimal encoded value of the character, so for instance é is represented as U+00E9 and ι is represented as U+03B9. The code points are also selected in such a way that retro compatibility is maintained with ISO-8859-1, and ASCII.

Currently Unicode has defined code points for more than 149186 characters (AKA code points), and this covers not only languages that are in active use today, but also historical (e.g. Cuneiform) or fictional languages (e.g. Tengwar4). Although it is important to note that most characters in use are encoded by the first 65,536 code points. In total Unicode allows the definition of up to 1114112 code points characters, so it is quite future-proof.

An important thing to note is that code points don't specify anything at all about how they are converted to actual bytes in memory or on a disk - they are abstract ideas. This is one of the key things to keep in mind when thinking about Unicode, there is by design a clear difference between identifying a character, representing it in a way that a computer can process it and rendering a glyph on a screen.

The good thing about standards is that you get to choose - Unknown

So if code points are notional, abstract ideas, how can computers make use of them? This is where encodings come into the picture. Unicode Standard offers several different options to represent characters: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The key idea here is that a code point (which is an integer at the end of the day) is represented by one or mode code units. UTF-32/16/8 offer different base sizes for these code units.

When working with UTF-32 and UTF-16, the endianness,that is the order of the most significant or least significant byte needs to be considered.

UTF-32 provides fixed length code units, making it simple to process. This comes at the cost of increased memory or disk storage space for characters.

$ hexdump -C hello-utf32-little-endian.txt
00000000 48 00 00 00 65 00 00 00 6c 00 00 00 6c 00 00 00 6f 00 00 00 |H...e...l...l...o...|
0000000a

$ hexdump -C hello-utf32-big-endian.txt
00000000 00 00 00 48 00 00 00 65 00 00 00 6c 00 00 00 6c 00 00 00 6f |...H...e...l...l...o|
0000000a

UTF-16 provides a balance between processing efficiency and storage requirements. This is because all the commonly used characters fit into a single 16-bit code unit, but it is important to keep in mind that this is still a variable length encoding (a code-point may span one or two code units). Fun fact: the JVM and the CLR use UTF-16 strings internally.

$ hexdump -C hello-utf16-little-endian.txt
00000000 48 00 65 00 6c 00 6c 00 6f 00 |H.e.l.l.o.|
0000000a

$ hexdump -C hello-utf16-big-endian.txt
00000000 00 48 00 65 00 6c 00 6c 00 6f |.H.e.l.l.o|
0000000a

Finally, UTF-8 is a byte oriented variable length encoding (so be careful with the assumption that each character is a byte, that is not what the 8 in UTF-8 means!). Since it is byte oriented, and the code points have been carefully chosen, this encoding is retro compatible with ASCII (note the example below). On the other hand a code point may be anywhere from 1 to 4 8 bit code units long, so processing will be more complex.

$ hexdump -C hello-utf8.txt
00000000 48 65 6c 6c 6f |Hello|
00000005

This essentially allows programmers to trade off simplicity in processing with resources (memory or storage space) and retro-compatibility requirements according to the specific needs of their application. The following picture (directly from Chapter 25 of the Unicode Standard may further clarify things):

UTF-32, UTF-16 and UTF-8 and the respective code units

Hopefully this brief intro provides a good foundation as to why Unicode has become the de-facto way to represent text, and the key difference between code points and encodings. This serves as a stepping stone to further explore the Unicode Standard, and in the next post in this series I will dive a bit deeper into how code points are structured, the types of characters that exist and how they are combined and can be normalized.


Footnotes

  1. The etnography of infrastructure

  2. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

  3. Unicode technical guide

  4. Tengwar and Unicode

  5. Unicode standard - Chapter 2