UniCipher
2023-01-03
The UTF-8 Unicode standard is brilliant for at least two reasons.
First, It uses variable-length encoding. With a regular fixed-length encoding, every character takes up the same N number of bytes. But with variable-length, we use only one or two bytes for "common" characters, and more bytes for less common characters (common in this case is admittedly us/euro-centric). Thus, a string will likely be full of common characters taking up less space than they "normally" would, while only having a few uncommon characters taking up more space than they "normally" would. In practice, this means that UTF-8 strings use less space than they would if every character required the same amount of bytes. The algorithm for doing this is actually pretty straightforward and easy to understand.
The second brilliant thing about UTF-8 is it's backwards compatible with ASCII encoding. In other words, any file that is valid ASCII is also valid UTF-8. ASCII was the dominant (though not only) string encoding scheme at the time UTF-8 arrived on the scene. It seems reasonable to argue that this backwards compatibility was a large contributor to UTF-8's success.
In this post, we're going to take advantage of (abuse?) the former with a nod to the latter. But first, a quick digression.
It came to me in a dream
The other night I had what can only be described as the weirdest and most technically specific dream I've ever had. In the part that I remember, I was viewing some text that was Egyptian hieroglyphs. However, I "realized" that within the text was actually a secret message. I intuited (the way one does in a dream) that the text was UTF-8 encoded, hence the hieroglyphs. But I also intuited that the actual message was comprised of just ASCII characters. If I looked at the bytes backing each hieroglyph, threw out the bits used for specifying the character byte length, and interpreted what was left as two ASCII characters, that would give me the real message.
THIS WAS SUPER WEIRD. I can't even begin to imagine how such a dream came to me. The algorithm in particular was so specific. And who has dreams about UTF-8? In the past, I've woken up with the solution to a hard problem I'd been thinking about before bed (this is actually a pretty common phenomenon). But I've never woken up with the answer to a question I didn't ask, where the specifics were laid out explicitly in dream-form.
The best explanation I can come up with is:
- I've been learning Rust recently. It makes a distinction between language native strings (always UTF-8) and "OsStrings" (encoded in what ever is native to the operating system you're running on). The programmer is forced to think about converting between the two when dealing with/manipulating file names, but not the details of the encodings.
- A few weeks ago I was setting up Nerd Fonts on my new laptop and I came across a blog post about installing them for Kitty. The post mentioned hieroglyphs.
I feel like that can hardly explain such a specific dream though. Honestly, I'm a little spooked by the whole thing. Never the less, it came to me in a dream, so how can I say no. Let's code it up.
Writing a cipher in Rust
In my dream, each hieroglyph was a two byte UTF-8 character and therefore I could fit two ASCII characters in it. In my waking state I realized that's not quite right. ASCII characters are 7 bits. I had thought, in my dream, that each byte in a UTF-8 character only used one bit for dealing with variable-length encoding. So two bytes should have given me 14 bits to work with.
This is actually incorrect. In UTF-8, it is true that in the case of a 1-byte character only one bit is needed to indicate the length of the character is 1 byte. But as you start needing more bytes, we loose more bits to the overhead of character byte length tracking. For a two byte UTF-8 character, the first three bits of the first byte and the first two bits of the second byte are needed for the length encoding scheme. This leaves us with only 11 bits, not enough to encode two ASCII characters. We actually need to go to three byte characters, giving us 16 total bits to play with. This table from Wikipedia is really handy for visualizing all this. Basically, the x's are bits we can use for our own purposes.
First Code Point | Last Code Point | Byte 0 | Byte 1 | Byte 2 | Byte 3 | Total Code Points |
---|---|---|---|---|---|---|
U+0000 | U+007F | 0xxxxxxx | 128 | |||
U+0080 | U+07FF | 110xxxxx | 10xxxxxx | 1920 | ||
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | 61440 | |
U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 1048576 |
With this knowledge in hand, we can create an algorithm for encoding two ASCII characters into a single UTF-8 character.
- For each ASCII character, call them C0 and C1, we put the most significant bit in the first byte of our output UTF-8 character.
- We then put the lower 6 bits of the C0 in the second byte of our UTF-8 character.
- Lastly, we'll but the lower 6 bits of the C1 in the third byte of our UTF-8 character.
To give a concrete example, if 'A' and 'B' are the most significant bits of C0 and C1 respectively, and 'a' and 'b' are the lower six bits of C0 and C1 respectively, then we have two ASCII characters that look like:
C0: 0Aaaaaaa
C1: 0Bbbbbbb
We the proceed to encode them in a 3-byte UTF-8 character thusly:
UTF-8 Character: 111000AB 10aaaaaa 10bbbbbb
The core of this algorithm turns out to be fairly straightforward. It's just some pretty simple bit manipulation. And decoding is just a matter of reversing the process. We end up using one more bit to indicate if there's actually only one ASCII character encoded. This is needed for the case where we're encoding an ASCII string that has an odd number of characters.
The cool thing here is our output UTF-8 character is valid. If we write out this Unicode to a file, the file should render without issue. It's almost like steganography, except I'd argue the text we output is gibberish, so not quite.
It's also worth noting that because there are 128 ASCII characters, when encoding two of them there are 1282 possible combinations. In addition, we have 128 more combinations due to the fact we need to be able to encode just a single character in the case of an odd length ASCII string. This means we could end up generating a total of 1282+128 (16512) possible Unicode Code points.
UniCipher
Putting this all together, you get my new program (abomination?): UniCipher.
Using the --encrypt
and --decrypt
arguments, one can encrypt and decrypt strings of ASCII characters:
The program also accepts input from stdin or a file and can output to a file as well.
Two ciphers are available for the program. The "standard" cipher does exactly what I've described above. Given the use of 3 bytes, most of our generated Unicode characters end up somewhere on the Basic Multilingual Plane, the majority of which are CJK characters.
The second cipher, "extended", uses 4 bytes instead of 3. The Unicode space consisting of 4 byte UTF-8 is big and not all of it is assigned. While this cipher generates valid Unicode code points, our particular algorithm ends up generating code points on the Supplementary Multilingual Plane, most of which is empty. So it's not as "good" as the standard cipher (for what ever good means in this context 😛). But if we pick our ASCII characters just right...