I recently got interested in encoding, and since I spent a lot of time reading about it, I though I might as well write a summary of what I learnt, which might be useful for others.

If the whole article is too long, please read just section 4.

There are many steps when you go from a handwritten text to a final text document on a computer. In this article, I would like to explain how computers were taught to store text (with focus on Unicode). In section 1, I describe how characters can be systematised such that computers can understand them. In section 2, the way characters are encoded (stored as zeros and ones) is discussed. Section 3 briefly goes over the way in which one can input characters into a computer and section 4 is the take-home message of the whole story.

In appendix A, it is demonstrated what the pitfalls of non-Unicode encodings are on the example of LaTeX.

In appendix B, you can download a simple Python script that analyses strings of characters in the Unicode system.

1 Character set/code point set

The first step is to teach the computer the characters that we want to use, when typing our text. All that it does is that it assigns a number (integer) to a character. Nothing more, nothing less. The character set may decide not to assign some numbers, to help encoding of the characters, if there is any particular encoding in mind. The oldest character set is ASCII, which includes only the Latin alphabet, a few symbols, and some control characters. This is because it assumed that every character has to be stored in 7 bits (zeros or ones), which amounts to 128 characters. When computers switched to using 8 bits (1 byte) as the basic unit of information, the maximum number of characters doubled, which resulted in a series of 8-bit encodings, most notably those specified in ISO/IEC 8859, which contains encodings for all European laguages and some others (You may have heard of Latin-1, Latin-2, etc.). I started to use the word encoding, because the standard specifies both the character set and the encoding. However, these two are different as we shall see in the next section. These encodings were a step forward, because now you could save many languages in digital files. The problem was, however, that for each file, you need to know, what the encoding is. And for each file you had to stay within this set of characters.

A remedy to this problem is called Unicode. Unicode is a set of code points, whose value can range from 0 to 1,114,111. If you imagine wanting to store a number in 4 bytes (32 bits) you get 232 options, which is 4,294,967,296. Actually, even some numbers within the above mentioned range are not to be filled for various reasons. Nevertheless, this number of characters is entirely sufficient for all languages in the world and beyond (more recently emoji, for example).

If you want to explicitly quote a Unicode code point, the recommended format is U+E010A, where the string of letters and numbers after the plus sign is the number of the code point in the hexadecimal system.

1.1 Unicode equivalence

Apart from having code points for characters, there are code points for parts of characters, like diacritical marks. This, however, resulted in an ambiguity, because many of the characters in Unicode consist of e.g. a letter with a diacritical mark, and can, therefore, be encoded in two different ways. For example, the letter é can be encoded just as a character é, or as a character e followed by a combining character ´ for acute accent. Such ambiguity is undesirable and we want to have a unique way of representing text. That is why Unicode also defines equivalence. And it is equivalence of two different kinds.

Canonical equivalence

The sequences are actually just different ways of encoding exactly the same character. For example, the unit of Ångstrom has a symbol Å, the code point of which is canonically equivalent to the code point of the Swedish letter Å and to the combination of the code points for capital letter A and the diacrtical mark ˚ for ring above (which is different to the degree sign °). Similarly, if you have a letter with two diacritical marks, it does not matter in which order the combining characters follow the main one, the result is equivalent.

Compatibility equivalence

If there are different ways of identifying the same piece of information, like the fraction ½ and 1⁄2 (i.e. 1 fraction-slash 2), they are deemed equivalent. Similarly, the Planck constant ℎ will map onto letter h, and circled numbers will map onto numbers (e.g. ① maps onto 1). Sometimes, Unicode has decided to include code points for characters with a specific style (e.g. cursive, bold, or alternative forms, like ſ the old Latin character for letter s). These are also mapped onto the regular letter. This form of equivalence is useful e.g. if you are searching through a document.

Normalisation forms

If you want to have texts in a consistent form, you can do one of these:

  • Canonical Decomposition (Normalization Form D = NFD)

    Decompose all characters as much as possible and order combining characters in a specific way.

  • Canonical Decomposition Followed by Canonical Composition (NFC)

    Just like NFD, but followed by a composition, where all combining characters are composed with the main one. Note that this is not always possible and some characters can be represented only in the decomposed form. This process is not always completely reversible. The Ångstrom symbol gets decomposed and then composed to the letter Å of the Swedish alphabet, which is the letter used for the Ångstrom symbol. The glyph (graphical form) for these two is supposed to be exactly the same.

  • Compatibility Decomposition (NFKD)

    Decompose all characters as much as possible using both canonical and compatibility equivalence and order combining characters.

  • Compatibility Decomposition Followed by Canonical Composition (NFKC)

    Just like NFKD, but followed by a canonical composition.

NFC and NFD should give exactly the same string of bytes for any two texts that are supposed to display exactly the same glyphs. NFKC and NFKD should then give the same string of bytes for any text containing the same information (broadly speaking).

1.2 Compatibility characters

The main selling point of Unicode is the universality. However, it arrived into a world full of different encodings. To improve adoption of Unicode, there are many code points that are included only for compatibility reasons, e.g. there is such a character in a different encoding and you do not want to lose this information, if you convert to Unicode and back. Ideally, anything written directly in Unicode should avoid these. They include things like ligatures (should be a property of the font, not the characters), precomposed Roman numerals (this Ⅻ is one character and should not be) or even precomposed fractions, like ¼, which should be identical to 1⁄4, if handled correctly.

In general, there are an awful lot of characters with Unicode code points that should not be used at all, if the document is created in Unicode, and the Unicode consortium itself discourages users from using compatibility characters.

1.3 Math fonts

Unicode provides an extensive set of mathematical symbols. In addition, there are also a few whole alphabets (bold, cursive, sans-serif, etc.). Unicode is supposed to encode only plain text without formatting. While in normal text, formatting does not change the meaning of words, in mathematics it does. An example could be the vector of velocity 𝒗 as opposed to the scalar speed 𝑣, which is its magnitude. Similarly, upright i and e represent the imaginary unit and the Euler’s number, while 𝑖 and 𝑒 would be an index variable and the elementary charge. To account for these very important meaning differences that should not be lost e.g. when copying and pasting, Unicode decided to include code points for

  • A, 𝐀, 𝐴, 𝑨 = regular, bold, italic, and bold italic serif Latin alphabet1
  • 𝒜, 𝓐 = regular and bold script (cursive, calligraphic) Latin alphabet
  • 𝔄, 𝕬 = regular and bold fraktur Latin alphabet
  • 𝔸 = double-struck Latin alphabet
  • 𝖠, 𝗔, 𝘈, 𝘼 = regular, bold, italic and bold italic sans-serif Latin alphabet
  • 𝙰 = monospace Latin alphabet
  • ω, 𝛚, 𝜔, 𝝎 = regular, bold, italic, bold italic serif Greek alphabet1
  • 𝞈, 𝟂 = bold and bold italic sans-serif Greek alphabet
  • 1, 𝟏 = regular and bold digits1
  • 𝟙 = double-struck digits
  • 𝟣, 𝟭 = regular and bold sans-serif digits
  • 𝟷 = monospace digits

This means that a mathematical formula copied and pasted between sources should preserve its meaning. However, these should never be used instead for normal formatting. These characters do not behave like normal letters and numbers (even though they would compatibility decompose to them). Remember, Unicode is all about plain text and no formatting. Unfortunately, this is often abused. Similarly, there are characters in upper and lower indeces in Unicode, which should be used only in the International Phonetic Alphabet (IPA).

1.4 Han Unification (Unihan)

An interesting obstacle in the development of Unicode was the inclusion of Han characters. These are used in traditional and simplified Chinese, Japanese, Korean, and sometimes Vietnamese. Many of these are shared among these languages and scripts, so the Unicode consortium decided that just like with languages using the Latin alphabet, each character will be encoded only once. Unfortunately, differences between characters have developed between countries. If new or different characters, in terms of meaning, were created, they have separate code points. If the same character got a different graphical representation, then the same code point was used. Sometimes a variation sequence (i.e. an additional code point following the character code point) can be used to distinguish these.

This has both advantages and disadvantages. It means that if you take a Japanese text and use a Chinese font to display it, a Japanese person could have problems with reading some of the characters. On the other hand, if you quote Chinese in a Japanese text, you would probably write it using a Japanese font. In this way, you can just copy and past between Chinese and Japanese texts, while simply displaying each text in a different font.

On websites, one can specify the language tag, which may lead to a correct selection of a font (but may not, based on browser settings).

2 Encoding

Once we have all required characters assigned to numbers, the next question is, how do we store the numbers using zeros and ones in a computer? You might want to say: “What do you mean by, how do you want to store it? Are not computers storing numbers all the time?” Of course they are, but it is never that easy.

2.1 UTF-32

The most straightforward way of storing Unicode text is called UTF-32. You can store any Unicode code point in four bytes (32 bits of memory). Then you stack code points one after another. Algorithmically, this is the easiest thing to do. However, it is quite wasteful in terms of memory. ASCII characters fit in one byte and characters of all living languages fit (basic multilingual plane) into two bytes. Therefore, you may not want to waste four bytes on every character, when almost certainly a half will do.

2.2 UTF-16

Since two bytes is enough for almost all code points, you could choose that as a basic unit of storage. Then you encounter a problem. What if a code point does not fit into two bytes? You have to tell the computer, in which pair of bytes a code point starts, and in which it continues. You do that by sacrificing the first 6 bits of each pair of bytes. If you have a code point, and its number in binary fits in two bytes:

xxxxxxxx  xxxxxxxx 

you encode it exactly like that. However, if you have a code point, the binary representation of which is larger than two bytes:

xxxx xxxxxxyy yyyyyyyy

You encode it as:

110110xx xxxxxxxx 110111yy yyyyyyyy

These are called a surrogate pair.

2.3 UTF-8

The most common way of encoding Unicode is, however, UTF-8. It encodes all the characters it can in a single byte. If it does not fit, it uses two bytes. Once those are full it uses three and when even that is not sufficient, it uses four. Thus, the first byte of each encoded code point needs to tell you, how many bytes the code point is going to occupy + we need to identify bytes that do not begin a new code point. Based on the length of your code point in binary, you would use one of the following ways to encode it:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

This has many advantages.

  • ASCII compatibility

    This was a huge advantage in the past, but still remains. If you have a Latin-based text, it is very likely it is encoded in one of the ASCII compatible encodings, and it will not be a complete mess, even if you do not get the encoding exactly right.

  • Space efficiency

    If your text is mostly or fully ASCII, you use around one byte per character. If you use only code points below U+07FF, you use two bytes per character. Then, there is a region, where UTF-16 would still use just two bytes, while UTF-8 uses three, which makes it less efficient. However, in most cases today, we are not very concerned with the size of plain text documents.

  • Endianness

    In UTF-16 and UTF-32, you are asking the computer to store numbers, which cannot fit in one byte. E.g.:

    110110xx xxxxxxxx | 110111yy yyyyyyyy
    

    are two numbers, each of which occupies two bytes. Unfortunately, there are two ways, in which computers deal with this: little-endian and big-endian, which would store the two bytes representing a number in different order. This means, that based on the kind of system, you get a different binary string and to correctly decode it, you need to know, what system was used to create it. Sometimes, byte order marks (BOM) can be prepended to the text, to help the computer decide. UTF-8 does not have these problems by construction.

For these and other reasons, UTF-8 has become by far the most dominant encoding today. It has been the recommanded encoding for HTML 5 onwards and over 95% of the web sites use it.

3 Typing Unicode characters into a computer

The first two steps are done. The computer can represent any thinkable character and store it in a binary file. However, our communication with computers is currently limited almost exclusively to what we can type on a keyboard or click on with a mouse. How can you possibly type the 143 859 different Unicode characters + many more combinations into a computer? Most people type on a keyboard that is based on a the typewriter and can usually type ASCII characters and some accented characters based on the country. For non-Latin alphabetic languages, they perhaps have ASCII complemented with the other alphabet. There are a few solutions

  • Keyboard layouts and modifier keys

    I keep constantly changing between English and Czech keyboard layouts. For coding, the English layout gives the best access to the required characters, but if I want to write in Czech including all its accents, I must use the Czech layout. In addition, you can access some characters using the Shift, Control and Alt modifier keys. However, I am using two layouts, and I already have troubles with typing non-alphanumeric characters. Thus, this is not a viable solution, if you want to cover more than a few languages and common symbols.

  • Character palettes

    In many operating systems and text editors, you can open a palette of Unicode characters click and scroll through the collection (perhaps even search based on the name of the character) until you find the desired character, and then copy and paste it. This can easily cover all of Unicode, but looking up characters is typically quite lengthy, even if they are intelligently grouped. Searching for the character online and copying it is basically the same as described above. Useful websites for this are:

    • Compart Unicode website

      More technical, but also more informative table of Unicode Characters.

    • Unicode Character Table

      More informal and encourages abuse of Unicode (e.g. Using characters incorrectly just to achieve “fancy” effects or encouraging use of compatibility characters.)

  • Typing in the code point number

    The method of last resort is directly typing in the number of the respective code point directly. Most operating systems and/or text editors have a key shortcuts that can be followed by the number of the code point (in hexadecimal or decimal). The obvious problem with this is that one would have to remember the code point numbers of all characters that one wishes to use.

  • Macros

    You can define macros that consist only of characters that you can directly type on a keyboard (probably ASCII) but can be easily remembered. (La)TeX went down this path and it is perhaps one of the reasons, why it is so popular with people, who need to type equations. For example, if you want to type the integral sign ∫ in LaTeX, you would type \int, for the summation sign ∑, you would type \sum. Most of these are quite intuitive and you can type them as easily as you type words. This, however, means that whatever you type needs to be post-processed by the computer to create the document. It also means that what you type is not as legible, as if you had the symbols in place from the very beginning. Nevertheless, due to its simplicity and ease of input, LaTeX is widely used in mathematics and physical sciences, and for equations, LaTeX can be used even in editors like Microsoft Word, Apple Pages.

  • Transcription + pallette

    For Asian scripts with many characters, there are typically standard ways of transcribing them into the Latin alphabet. The users can then type it in on a keyboard. Since the transcription may not be lossless, the user may be offered a selection of characters that match. This input system is a half way between macros and palletes, since each character is assigned a Latin character string, like in macros, but then the Unicode character is inserted immediately into the document, upon a selection from a palette.

4 Moral of the story

It is one of the great victories for the humanity (following e.g. the SI unit system, the international atomic time and the internet) to have a universal code point set for all the characters in the world. It means that if you give me a sequence of zeros and ones, I will almost certainly be able to decode it and display it correctly without further knowledge. The take-home messages are:

  • Use UTF-8

    By default, in any document or piece of text that you create or work with, use Unicode with UTF-8 encoding. Yes, there are occasions, when it is not the best encoding to use, but if you ever need to do such specialist work, you will know what to do (or seek help). This rule is so clear cut, that it is suprising, how many things still use other encodings where UTF-8 would work well.

  • Use only the Unicode characters that you actually need

    A standard is worthless if nobody uses it. For this reason Unicode added loads of compatibility characters, to be compatible with existing encodings. At the same time, Unicode consortium discourages their use for any texts that are newly created. Therefore, be careful, when copy-pasting text from around the web. Basically for every symbol, you can find obscure versions, just because they were encoded somewhere else. Subscripts, superscirpts, half- and full-width forms, parenthesised, encircled and many other forms. As an example, here all all the question marks found in Unicode ?,❓,❔,﹖,?,︖. All but the first one are obscure or compatibility characters. Using these is just risking that the font will not support it and all you and up with is an empty square instead of your character. All styling of text should be done using your editor or a markup language, so don’t do things like ᏗᎴᏗᎷ ᎮᏒᏗᎴᏗ or 𝒜𝒹𝒶𝓂 𝒫𝓇𝒶𝒹𝒶, which should be achieved by selecting an appropriate font for normal characters.

Appendix A: The LaTeX encoding hell

A good example of why Unicode (ideally UTF-8) encoding should be adopted is how LaTeX handles encodings. LaTeX is a document typesetting system that I lovingly use all the time. It is based on an older, but more general, typesetting system called TeX. The way LaTeX works is that you have a text file with contents of your document with some additional code that tells the computer which parts are headings, sections etc. I will not go into details, but the system on the whole has many advantages, which is why people use it. However, TeX was invented in 1978 and LaTeX followed in 1984. Unicode originated only in 1991 and it was not immediately clear that it is the future world standard. And it was only until 1993 when PDF was released. This should give you an idea of how ancient systems TeX and LaTeX are.

Thus, in those prehistoric days of computing a powerful typesetting program was born. Only 8-bit (and actually at the very beginning only 7-bit) encodings were used. Computational power and space were very precious. Under these circumstances it made perfect sense to do things the way they were done.

A.1 LaTeX workflow

LaTeX first reads an input file. To do it correctly, it needs to know, what encoding was used to make it. (This part was easily adapted to UTF-8.) Then it translates all characters into LICR (LaTeX Internal Character Representation). There, every character can be described with ASCII characters, e.g. ř would be \v r. Then LaTeX does all its magic to lay out all characters and images onto the page. However, next it must encode each character such that the computer finds the correct glyph in the selected font. In the days of 8-bit encodings, only 256 different characters could be encoded using the same encoding (and in reality, it was even less). This has great consequences.

Firstly, if your document contains characters that do not fit in one of the encodings, you have to change the encoding half way thorugh your input file, possibly many times back and forth. This poses no problem for LaTeX, but it does for the user, because in the source file you have to intentionally type wrong characters that will map onto the correct ones in the other encoding. In the final document, however, everything wil be fine. This was the reason for LICR, because once everything was translated to LICR, the developers did not have to care about various encodings that users may have used, and they could make everything as simple and portable as possible.

Secondly, a font defines a glyph for each encoded character. Then LaTeX chooses the output encoding, and for each character it looks up the appropriate glyph and places it on the page. However, if you can encode only that many characters in a single encoding and each font maps glyphs onto a given encoding, you need many separate fonts to cover a reasonable set of characters.

In addition, even if two different fonts have the same glyph, if they are saved using different encodings, the same glyph may be in different positions in each font, which makes changing fonts difficult.

A.2 Piling things up

TeX was created by Donald Knuth, a computer scientist and a mathematician, to typeset his own books. All he needed was the Latin alphabet, maybe with a few accents, and mathematical symbols. So he created a set of fonts and encodigins covering his needs. One for normal text and a few for mathematics. TeX would know when to change the from one encoding to another. (Knuth actually created a programming language called METAFONT, in order to produce his fonts. It is amazing how many of the things we use today were created in the old days single-handedly by programmers for their own needs. Other examples include Unix and git created by Linus Torvalds.)

As TeX and later LaTeX got more popular, people started needing more characters than those. So they added them into LICR and came up with more and more encodings. And whenever you wanted to change the font of your document, you needed a family of fonts covering all the encodings that you used. If one was designing a similar system today, all input could be done in Unicode, LICR could also be Unicode (perhaps an extension of it) and output would definitely be Unicode, since all modern fonts use it. How simple would that be.

The original system is actually partially in place till today. The reason is that in community projects like TeX and LaTeX, compatibility is crucial. A company making software for businesses and institutions can more easily force customers to switch to a new version of their software. A community project, where you have a few core developers (still doing it in their free time, even today!) and the rest is contributed by the users, if your new version is not quickly adopted by majority of users, the project is doomed. That is why TeX and LaTeX are evolving extremely slowly and carefully. There can be more than 10 or even 20 years between major versions. This stability allows other projects to build on top of it.

A.3 Unicode and LaTeX

Luckily, XeTeX and LuaTeX (and XeLaTeX and LuaLaTeX) are modern versions of TeX and LaTeX that support Unicode and Unicode-based fonts (OpenType and TrueType). However, these have not (yet) superseded the originals, because many people depend on LaTeX, and XeTeX and LuaTeX are not fully compatible. This lead to a dichotomy, where not all packages for LaTeX are available for others, but it is getting better with time. Today, you can quite safely use any basically any font and any Unicode character with LaTeX and hopefully,

B Playing with Unicode

To explore Unicode characters, I wrote a simple Python script that takes in a string of characters as an argument and applies all normalisations to all characters of the string. Feel free to download it here and give it a try. Below is a sample output from this string “ř dž Å ℓ ℏ ¼ ℃ Ω K 鰳”.

*******************************************************************
*******************************************************************
Original character
----------------------------------------------------------
                              Character: ř
                          Code point in  
                            hexadecimal: 0x159
                                decimal: 345
                                   Name: LATIN SMALL LETTER R WITH CARON
                               Category: Ll
                      Combining property 0

NFD (Canonical decomposition)
----------------------------------------------------------
                              Character: r
                          Code point in  
                            hexadecimal: 0x72
                                decimal: 114
                                   Name: LATIN SMALL LETTER R
                               Category: Ll
                      Combining property 0

                              Character:  ̌
                          Code point in  
                            hexadecimal: 0x30c
                                decimal: 780
                                   Name: COMBINING CARON
                               Category: Mn
                      Combining property 230


NFC (Canonical decomposition followed by composition)
----------------------------------------------------------
    Identical to the original code point 

NFKD (Compatibility decomposition)
----------------------------------------------------------
                        Identical to NFD 

NFKC (Compatibility decomposition followed by composition)
----------------------------------------------------------
    Identical to the original code point 

*******************************************************************
*******************************************************************
Original character
----------------------------------------------------------
                              Character: dž
                          Code point in  
                            hexadecimal: 0x1c6
                                decimal: 454
                                   Name: LATIN SMALL LETTER DZ WITH CARON
                               Category: Ll
                      Combining property 0

NFD (Canonical decomposition)
----------------------------------------------------------
    Identical to the original code point 

NFC (Canonical decomposition followed by composition)
----------------------------------------------------------
    Identical to the original code point 

NFKD (Compatibility decomposition)
----------------------------------------------------------
                              Character: d
                          Code point in  
                            hexadecimal: 0x64
                                decimal: 100
                                   Name: LATIN SMALL LETTER D
                               Category: Ll
                      Combining property 0

                              Character: z
                          Code point in  
                            hexadecimal: 0x7a
                                decimal: 122
                                   Name: LATIN SMALL LETTER Z
                               Category: Ll
                      Combining property 0

                              Character:  ̌
                          Code point in  
                            hexadecimal: 0x30c
                                decimal: 780
                                   Name: COMBINING CARON
                               Category: Mn
                      Combining property 230


NFKC (Compatibility decomposition followed by composition)
----------------------------------------------------------
                              Character: dž
                          Code point in  
                            hexadecimal: 0x1c6
                                decimal: 454
                                   Name: LATIN SMALL LETTER DZ WITH CARON
                               Category: Ll
                      Combining property 0


*******************************************************************
*******************************************************************
Original character
----------------------------------------------------------
                              Character: Å
                          Code point in  
                            hexadecimal: 0x212b
                                decimal: 8491
                                   Name: ANGSTROM SIGN
                               Category: Lu
                      Combining property 0

NFD (Canonical decomposition)
----------------------------------------------------------
                              Character: A
                          Code point in  
                            hexadecimal: 0x41
                                decimal: 65
                                   Name: LATIN CAPITAL LETTER A
                               Category: Lu
                      Combining property 0

                              Character:  ̊
                          Code point in  
                            hexadecimal: 0x30a
                                decimal: 778
                                   Name: COMBINING RING ABOVE
                               Category: Mn
                      Combining property 230


NFC (Canonical decomposition followed by composition)
----------------------------------------------------------
                              Character: Å
                          Code point in  
                            hexadecimal: 0xc5
                                decimal: 197
                                   Name: LATIN CAPITAL LETTER A WITH RING ABOVE
                               Category: Lu
                      Combining property 0


NFKD (Compatibility decomposition)
----------------------------------------------------------
                        Identical to NFD 

NFKC (Compatibility decomposition followed by composition)
----------------------------------------------------------
                        Identical to NFC 

*******************************************************************
*******************************************************************
Original character
----------------------------------------------------------
                              Character: ℓ
                          Code point in  
                            hexadecimal: 0x2113
                                decimal: 8467
                                   Name: SCRIPT SMALL L
                               Category: Ll
                      Combining property 0

NFD (Canonical decomposition)
----------------------------------------------------------
    Identical to the original code point 

NFC (Canonical decomposition followed by composition)
----------------------------------------------------------
    Identical to the original code point 

NFKD (Compatibility decomposition)
----------------------------------------------------------
                              Character: l
                          Code point in  
                            hexadecimal: 0x6c
                                decimal: 108
                                   Name: LATIN SMALL LETTER L
                               Category: Ll
                      Combining property 0


NFKC (Compatibility decomposition followed by composition)
----------------------------------------------------------
                       Identical to NFKD 

*******************************************************************
*******************************************************************
Original character
----------------------------------------------------------
                              Character: ℏ
                          Code point in  
                            hexadecimal: 0x210f
                                decimal: 8463
                                   Name: PLANCK CONSTANT OVER TWO PI
                               Category: Ll
                      Combining property 0

NFD (Canonical decomposition)
----------------------------------------------------------
    Identical to the original code point 

NFC (Canonical decomposition followed by composition)
----------------------------------------------------------
    Identical to the original code point 

NFKD (Compatibility decomposition)
----------------------------------------------------------
                              Character: ħ
                          Code point in  
                            hexadecimal: 0x127
                                decimal: 295
                                   Name: LATIN SMALL LETTER H WITH STROKE
                               Category: Ll
                      Combining property 0


NFKC (Compatibility decomposition followed by composition)
----------------------------------------------------------
                       Identical to NFKD 

*******************************************************************
*******************************************************************
Original character
----------------------------------------------------------
                              Character: ¼
                          Code point in  
                            hexadecimal: 0xbc
                                decimal: 188
                                   Name: VULGAR FRACTION ONE QUARTER
                               Category: No
                      Combining property 0

NFD (Canonical decomposition)
----------------------------------------------------------
    Identical to the original code point 

NFC (Canonical decomposition followed by composition)
----------------------------------------------------------
    Identical to the original code point 

NFKD (Compatibility decomposition)
----------------------------------------------------------
                              Character: 1
                          Code point in  
                            hexadecimal: 0x31
                                decimal: 49
                                   Name: DIGIT ONE
                               Category: Nd
                      Combining property 0

                              Character: ⁄
                          Code point in  
                            hexadecimal: 0x2044
                                decimal: 8260
                                   Name: FRACTION SLASH
                               Category: Sm
                      Combining property 0

                              Character: 4
                          Code point in  
                            hexadecimal: 0x34
                                decimal: 52
                                   Name: DIGIT FOUR
                               Category: Nd
                      Combining property 0


NFKC (Compatibility decomposition followed by composition)
----------------------------------------------------------
                       Identical to NFKD 

*******************************************************************
*******************************************************************
Original character
----------------------------------------------------------
                              Character: ℃
                          Code point in  
                            hexadecimal: 0x2103
                                decimal: 8451
                                   Name: DEGREE CELSIUS
                               Category: So
                      Combining property 0

NFD (Canonical decomposition)
----------------------------------------------------------
    Identical to the original code point 

NFC (Canonical decomposition followed by composition)
----------------------------------------------------------
    Identical to the original code point 

NFKD (Compatibility decomposition)
----------------------------------------------------------
                              Character: °
                          Code point in  
                            hexadecimal: 0xb0
                                decimal: 176
                                   Name: DEGREE SIGN
                               Category: So
                      Combining property 0

                              Character: C
                          Code point in  
                            hexadecimal: 0x43
                                decimal: 67
                                   Name: LATIN CAPITAL LETTER C
                               Category: Lu
                      Combining property 0


NFKC (Compatibility decomposition followed by composition)
----------------------------------------------------------
                       Identical to NFKD 

*******************************************************************
*******************************************************************
Original character
----------------------------------------------------------
                              Character: Ω
                          Code point in  
                            hexadecimal: 0x2126
                                decimal: 8486
                                   Name: OHM SIGN
                               Category: Lu
                      Combining property 0

NFD (Canonical decomposition)
----------------------------------------------------------
                              Character: Ω
                          Code point in  
                            hexadecimal: 0x3a9
                                decimal: 937
                                   Name: GREEK CAPITAL LETTER OMEGA
                               Category: Lu
                      Combining property 0


NFC (Canonical decomposition followed by composition)
----------------------------------------------------------
                        Identical to NFD 

NFKD (Compatibility decomposition)
----------------------------------------------------------
                        Identical to NFD 

NFKC (Compatibility decomposition followed by composition)
----------------------------------------------------------
                        Identical to NFD 

*******************************************************************
*******************************************************************
Original character
----------------------------------------------------------
                              Character: K
                          Code point in  
                            hexadecimal: 0x212a
                                decimal: 8490
                                   Name: KELVIN SIGN
                               Category: Lu
                      Combining property 0

NFD (Canonical decomposition)
----------------------------------------------------------
                              Character: K
                          Code point in  
                            hexadecimal: 0x4b
                                decimal: 75
                                   Name: LATIN CAPITAL LETTER K
                               Category: Lu
                      Combining property 0


NFC (Canonical decomposition followed by composition)
----------------------------------------------------------
                        Identical to NFD 

NFKD (Compatibility decomposition)
----------------------------------------------------------
                        Identical to NFD 

NFKC (Compatibility decomposition followed by composition)
----------------------------------------------------------
                        Identical to NFD 

*******************************************************************
*******************************************************************
Original character
----------------------------------------------------------
                              Character: 鰳
                          Code point in  
                            hexadecimal: 0x9c33
                                decimal: 39987
                                   Name: CJK UNIFIED IDEOGRAPH-9C33
                               Category: Lo
                      Combining property 0

NFD (Canonical decomposition)
----------------------------------------------------------
    Identical to the original code point 

NFC (Canonical decomposition followed by composition)
----------------------------------------------------------
    Identical to the original code point 

NFKD (Compatibility decomposition)
----------------------------------------------------------
    Identical to the original code point 

NFKC (Compatibility decomposition followed by composition)
----------------------------------------------------------
    Identical to the original code point 

Footnotes

  1. For regular upright serif characters, the normal ASCII code points are used. For regular Greek, the code points for regular Greek language are used. This means that one must use a serif font by default to display these properly. Since this page is typeset in a sans-serif font, I had to manually specify the serif font family for the upright serif Latin alphabet and regular serif digits.  2 3