Richard Suchenwirth 2001-02-28 - From a delightful debugging chat at the
Tcl chatroom, I was brought to write down what I think on UTF-8 analysis (cf.
Unicode and UTF-8, see also
UTF-8 history).
I imagine a UTF-8 string as a railroad. It operates single-unit
railcars (one-byte ASCII characters, to be known from the fact that the highest bit is 0), and
trains (sequences of two or more bytes that together form a character). Each train consists of exactly one
locomotive (you see I'm European) and one or more
trailers. The
locomotive indicates the length of the train, including itself, in the highest bits that form a consecutive row of 1's, and one 0 bit. Examples:
0xxxxxxx : I'm a railcar, just a single unit
110xxxxx : I'm leading a train of length 2
The xxxxx bits are used for other purposes (locomotives carry some freight too ;-)
Trailers indicate that they are trailers by the initial bit sequence 10. This way, they can't be mistaken for railcars or locomotives. E.g.
10yyyyyy: I'm a trailer
The freight of the train is in the x's and y's. In that concrete case, a C program reported to have received the bytes C3 and A4. Written as binary, that's
11000011 10100100
Now for clarity we delimit the indicators with parens:
(110)00011 (10)100100
and can just remove them:
00011 100100
11100100 => E4, the iso8859-1 value for German ä ("a umlaut").
The generalized rule for the indicator of each byte is "those bits from highest(leftmost) down, up to and including the first zero bit".
Now going the other way. In orthodox UTF-8, a
NUL byte(\x00) is represented by a NUL byte. Plain enough. But in Tcl we sometimes want NUL bytes inside "binary" strings (e.g. image data), without them terminating it as a real NUL byte does. To represent a NUL byte without any physical NUL bytes, we treat it like a character above ASCII, which must be a minimum two bytes long:
(110)00000 (10)000000 => C0 80
Whoops. Took us a while, but now we can read UTF-8, bit by bit.
andrewsh 2010-03-12 - Please note that 0xc0 0x80 sequence is illegal in the "Real" UTF-8: [
1], [
2]