Notes on Joel Spolsky's blog post on character encodings

[An encoded] string only makes sense when we know what encoding it uses; otherwise, we cannot interpret it correctly.

First, what does it mean to encode/decode a string? Encoding converts the abstraction into bytes, decoding converts the bytes into an abstraction. Usually we get the bytes as an array of numbers in base10 or hex where each number represents the underlying byte. A string is an abstraction that lives at a higher level than zeros and ones. For instance, we might have a JavaScript string , so that when we are working in JavaScript, we can ignore the fact that JavaScript stores the string in memory as zeros and ones. That's what makes abstraction useful: it surfaces the context-specific stuff so we need not worry about the details. When we encode a string , we surface those underlying details. Running encode gets us the zeros and ones that represent it memory. Running decode on those zeros and ones gets us the abstraction again for easier use at a higher level such as JavaScript programming.

What follows are notes on Joel Spolsky's blog post on character encodings.

ASCII represents "every character" using a number between 32 and 127 and uses the rest of a byte (8-bits) for control characters and special characters.

The ANSI Standard standardized the characters assigned to the numbers 0 to 127 and created "code pages" that specified different ways to handle number from 128 to 255.

At the point in history when ASCII ruled, a character was generally considered to be a byte. In other words, each ASCII character maps to some 8-bit number.

character -> number less than 255 -> 8 bits in memory

e.g. 

A -> 65 -> 0100 0001

Enter Unicode where things go like this instead:

character -> code point -> some bits in memory

e.g. 

A -> U+0041 -> ...

Those "some bits" in Unicode depend on the Unicode encoding. There are hundreds of Unicode encodings. Here are three:

UCS2 / UTF-16 high-endian
UCS2 / UTF-16 low-ending
UTF-8

Those encodings and others like it can store any Unicode code point. When some other encoding (e.g. ASCII) cannot correctly represent a Unicode code point we instead see question marks (�) or boxes.

How does UTF-8 encode more than 256 characters when it only uses 8-bits? Well, UTF-8 uses more than 8 bits. Joel writes: "In UTF-8, every code point from 0-127 is stored in a single byte [and] code points 128 and above are stored using 2, 3, in fact, up to 6 bytes." That allows UTF-8 to be compatible with the first 128 ASCII characters.

Some Other Useful References