I gave a lightning talk at pdxruby recently. I was trying to explain the gotchas but was doing live coding in pry and it wasn’t enough time for me to figure out some nice succinct take-aways. My bigger point was something like “our industry seems to keep forgetting certain things”. This is not to say Yer Doin It Wrong. I just think it’s interesting that some things keep coming up because they are very rare.
- How to generate an SSL cert
- Encoding and utf-8
- Database salts
- HTTP and RFCs - I personally have forgotten or misremembered something
Even if you’ve done it many times, you haven’t done it recently (like just now) so we all forget. This theme is interesting! Different teams, people, states and projects … some common patterns maybe? Many times with these hard subjects, I often come across as “wrong!” and that’s not what I’m trying to do. I just want to point out where the key things are so that you can remember where to look to google some more or trigger your memory.
So, this encoding thing. Ruby 2.x changed lots of things. First, your source
file is utf-8. Your strings are utf-8 by default. There’s more to it than
that but it’s all pretty much utf-8 now. There’s also no iconv
in stdlib
anymore. It’s just .encode
off the string class (we’ll get to that in a
second).
Your Encoding Friends
Open up pry (if you don’t have pry, gem install pry
). It’s all you’ll need.
If you do ls Encoding
, you’ll see a list of encodings that Ruby supports.
You get this for free in every process. You don’t need to do anything special.
You’ll notice that "".encoding
is => #<Encoding:UTF-8>
. That inspected
Encoding:UTF-8
bit is coming from that list.
There’s also a shorthand versions of these encoding names that you can use but I like using
the constants where I can because it’s namespaced with Encoding
so it’s more
intention-revealing. So let’s write a file as utf-8 so I can explain the shorthand thing.
This is pretty straight-forward. It creates a file with awesome in it, encoded in utf-8.
You can’t say ‘w:latin-1’ here. That’s another name for iso-8859-1 but latin-1
doesn’t work
here for the file writing mode.
You can write a few modes in different encodings and the bytes come out exactly the same. There’s a historical reason for this. EBDIC begat ASCII begat ANSI (sort of) begat Unicode. All along the way, the lowest bytes stayed backwards compatible.
This is also why English speaking programmers are surprised by encoding errors because you can get away with a lot by sticking with these low order bytes and remaining ignorant (slightly strong word but intended in its opportunity sense). It’s only when “weird” data comes in that we have to think about encoding right?
Here’s another friend. If you do Encoding::BINARY.to_s
you’ll get
‘ASCII-8BIT’. This is the same as saying “I don’t know”. It’s not
the same as Encoding::ASCII
. You can tell because .to_s
says
‘US-ASCII’. So .to_s
can be handy here.
There is a method called .encode
. This takes the place of Iconv
in the stdlib. It works just like the unix command iconv
. It
takes one encoding and converts the bytes into another. This
isn’t the same as .force_encoding
as we’ll see in a second.
Now this is where culture/language trickiness comes in.
Lucky
All these things are the same bytes because we (sort of) got lucky on our history, where ASCII came from (A is for American) and kind of how computer keyboards and alphabets work. Someone had a good counter argument to this statement at the meetup and I agree. What I mean is, some of this is a bit culturally sensitive and complicated.
What I really mean is:
- English works well on a keyboard
- Keyboards are the fastest input device
- ASCII was invented by English speakers
- UTF-8 is extended ASCII
- English was invented before the computer
So, world, I’m sorry (empathy not apology).
What Encoding Is
Take this string "\x20"
. It’s a space character. If you look at man ascii
you’ll see that 20 is “ “ in ASCII. You might recognize this from %20
in
URLs. 20 decimal is 20 in hex too. The \x
bit means hex. URL encoding is
hex too so 20 is the same 20. If I pick something higher in the codepage
like "\xC3"
, things are going to get weird. “\xC3” by itself isn’t valid
utf-8. And that’s fine until I try to do something with it. If I print it,
it’s nothing. Puts just gives me the normal newline.
If I combine it with \x20
, that’s not valid. ASCII space is at the top of
the UTF-8 codepage. I can’t just make up stuff. Or maybe I can and get lucky.
But in this case, it prints the unknown utf-8 symbol: <?>
If I try something else,
just a different error message shows up:
And not that this can’t be done. If I use something that definitely fits in the ascii range (low bytes), everything is fine by implicit coincidence.
So what’s going on? Let’s look at this new string “YAY”.
So 89 is what in hex … um … piece of paper
89.to_s(16)
=> 59
Right. So “YAY” is
We can take this and get
Because ASCII fits inside the beginning of utf-8.
We could do this all day and not flip a bit. It’s just not modifying the byte sequence and that’s really what the data is.
So that’s the happy path with ASCII. It just sort of luckily works because of history and other things that are complicated. The more complicated path involves a few things. First, what happens when Ruby loses control of the encoding it knows about and finally what happens when non-ASCII things start happening.
This is the Korean word for wizard. I don’t know Korean btw. It’s just an easy alphabet and I think it’s neat.
Nothing in .bytes
is going to be over 255 because bytes are 8-bit.
You’ll never, ever see .bytes
return anything over 255. So what’s the deal?
Why are there more bytes there? Is it because Korean has more letters
inside each of those characters? No, that guess doesn’t make sense when I do
this with a single “character”:
It’s because utf-8 is dynamic. ASCII fits in 1 byte. If we force this to
Encoding::UTF_16
, it has four bytes. What we thing of as a letter is
irrelevant. It’s bytes and codepoints in an encoding scheme. ASCII/English
just happens to be lucky at the top of the number chart.
So let’s turn that single character into utf-16 (Java’s default).
But that doesn’t mean we should. And … if we force this the wrong way, we’ll
have a bad time. Ruby won’t change the bytes if you do .force_encoding
. But
it will if you .encode
, as you can see. It depends what you are trying to
do.
Next, I’m going to show what you can do with all of this.
Data Corruption
Let’s take a more practical example. Let’s say a file was written in the wrong encoding. This could be a database backup file that you really care about. You could use iconv but let’s play in pry because it’s more fun and interactive.
Let’s set up the failure scenario.
If you just try to .force_encoding
it’s not going to work.
Interestingly, .force_encoding
sticks. So let’s try again, knowing the path
that the data took. We can reverse it:
- First the data was utf-8.
- Then it was forced to be latin1 but it’s in a utf-8 file.
- Then it was read as a latin1 file.
Since the read happened in Ruby-land, we can force_encoding
the file reading
mistake. Now it’s a utf-8 string that was forced to latin1 in mistake 2. So
we just have to re-encode those bytes back to latin1. Finally, it was utf-8 in
mistake 1. So we can just force_encoding
the last step because it wasn’t
written externally or re-encoded, the bytes were forced.
You can do it as one big line and play with this. Just make sure to check your
encoding of your play variables. The variable import
is now utf-8 so weird
things will happen if you think it’s latin1. Re-read the file with readlines
to reset your playtime.
UTF-8 Doesn’t Just Solve Everything
Base64 encodes to ASCII. So you’ll have very similar problems like above.
Conclusion
Encoding is hard. It comes up a lot. I forget what I have learned.
I hope this is a beacon to myself and others about some lessons and tricks.
Playing with this stuff now might save you stress later when something real
pops up. I’ve seen backups be useless and then saved with iconv
tricks and
Ruby’s encode method is the same thing.