name: open class: center, middle # Encoding is .highlight-danger[Hard] ## it's not you --- # Argument from Authority * I am not expert level - whatever that is * Not _"Yer Doing It Wrong"_, not a reaction * Just funny that Encoding is one of those forever problems --- # You Get .highlight-medium[Nothing] * Encoding problems will still happen even after this talk * ^^ Why? --- # Encoding is always required. Encoding is .highlight-ok[always there]. * vim - `set encoding=utf-8; set fileencoding=utf-8` * iTerm - `Terminal --> Character Encoding` * your shell - `export LC_LOCALE` * tmux - `set-window-option -g utf8 on` * Language strings and IO - Ruby / ObjC * Database defaults, table defaults * Database client defaults * HTTP payloads * Payloads modified by our users' browser * Files --- # Encoding happens. There's no such thing as plain text. ```ruby File.open('/tmp/awesome.txt', 'w') {|f| file.puts "awesome" } # same File.open('/tmp/awesome.txt', 'w:utf-8') {|f| file.puts "awesome" } ``` --- # What is Encoding > Humans --> "hi"
> Computers --> Bytes Whyyyy thoughhhhhh????? * Why can't we just have text stored directly? * Good question! --- class: split-30 # Super Fast History .column[!["ascii character map](images/ascii.png)] .column[ASCII mapped characters straight to numbers. Those numbers could be stored in 7 bits. Eventually people realized you could store extra stuff in 127+ because you had an extra bit.] --- # Standards Replaced Other Standards * EBCDIC (punchards era) begat ASCII. * ASCII begat IBM Code Pages * IBM Code Pages begat ANSI standard * Move to code pages for higher bit characters. --- # Unicode `U+0041` Abstraction of number and encoding. Unicode represents "A" to a code point. A code point doesn't have an encoding. Unicode is 16-bit but that's not why it can store so many characters. Unicode defines the code points and has many planes. An encoding then later makes decisions about byte length. --- # Byte Abstraction ``` H E L L O U+0048 U+0065 U+006C U+006C U+006F (unicode code points) 00 48 00 65 00 6C 00 6C 00 6F (little endian) 48 00 65 00 6C 00 6C 00 6F 00 (big endian) ``` If we just map code points to bytes we already have two ways to store bytes (CPUs can be different!). Instead put a byte-order mark in front: `FE FF 00 48` Bytes abstracted from letter. 48 doesn't mean A. It means a code point which then later means A. --- # Abstraction Means Decoupled ```sh # This is why in Ruby, you have unicode literals \xNNN - hex - codepage coupled \uNNN - unicode ``` --- # UTF-8 Now we can talk about UTF-8. Variable length byte size. Lucky coincidence that ASCII haps to characters below 127. Weee! English speakers don't have to think. This is also why we are surprised. --- # Exploring Encoding with Ruby ```ruby # You can see that 8bit is the same as binary. Encoding.aliases["BINARY"] # => "ASCII-8BIT" # You can get a list of encodings like this: pry> ls Encoding constants: ANSI_X3_4_1968 CP869 IBM862 KOI8_U US_ASCII ASCII CP874 IBM863 MacCentEuro UTF8_DOCOMO # These encodings also have short names. pry> Encoding::ISO8859_1.names => ["ISO-8859-1", "ISO8859-1"] ``` --- ```ruby # Module names in Ruby cannot have - in the name so standard # encodings with - in the name are snake cased. "asdf".force_encoding(Encoding::UTF_8) # The short names can be used in place of the constant names "asdf".force_encoding("utf-8") # Upper and lower case doesn't matter. "asdf".force_encoding("UTF-8") # Although you have to do something like this. File.open("hi.txt", "w:#{Encoding::UTF_8}") {|f| f.puts "hi" } # Which is less idiomatic than this: File.open("hi.txt", "w:UTF-8") {|f| f.puts "hi" } ``` --- # XXD ```ruby File.open("spaces.txt", "w") {|file| file.puts " " } ``` xxd is your friend! ```sh # $ /tmp > xxd spaces.txt # 00000000: 200a # 0a is the newline. pry> File.open("spaces.txt", "w") {|file| file.print "\u0020" } # $ file spaces.txt # spaces.txt: very short file (no magic) # $ /tmp > xxd spaces.txt # 00000000: 20 ``` --- ```ruby File.open("spaces.txt", "w") {|file| file.print "\n\n\n" } ``` ```sh # xxd spaces.txt # 00000000: 0a0a 0a ... ``` A more real example. You can see that the bytes are in groups. That's what that space after `0a0a` is. It's a space in spaces. ```sh 001f3f70: 3a20 4174 7461 6368 696e 6720 4950 7636 : Attaching IPv6 001f3f80: 206f 6e20 6177 646c 300a 4d6f 6e20 4a61 on awdl0.Mon Ja 001f3f90: 6e20 2034 2031 303a 3237 3a33 322e 3032 n 4 10:27:32.02 001f3fa0: 3920 496e 666f 3a20 3c61 6972 706f 7274 9 Info:
PRIORITY 001f3fc0: 4c4f 434b 2041 4444 4544 205b 636c 6965 LOCK ADDED [clie 001f3fd0: 6e74 3d61 6972 706f 7274 642c 2074 7970 nt=airportd, typ 001f3fe0: 653d 342c 2069 6e74 6572 6661 6365 3d65 e=4, interface=e 001f3ff0: 6e30 2c20 7072 696f 7269 7479 3d37 5d0a n0, priority=7]. ``` --- class: center,middle,inverse space in spaces --- class: center,middle,inverse good band name --- ```ruby pry> File.open("space.txt", "w:utf-8") {|f| f.print "\u0020" } ``` ```sh $ /tmp > xxd space.txt 00000000: 20 ``` ```ruby pry> " ".eql? "\u0020" => true ``` --- ```ruby pry> File.open("space.txt", "w:iso-8859-1") {|f| f.print "\u0020" } ``` ```sh $ /tmp > xxd space.txt 00000000: 20 ``` ```ruby pry> File.open("space.txt", "w:us-ascii") {|f| f.print "\u0020" } ``` ```sh $ /tmp > xxd space.txt 00000000: 20 ``` --- # Ruby Literals ``` \uNNNN Unicode codepoint U+NNNN \xNN Character with hexidecimal value NN \nNN Character with octal value NN ``` ```ruby # So now we can do this: pry> File.open("space.txt", "w:us-ascii") {|f| f.print "\x20" } ``` ```sh $ /tmp > xxd space.txt 00000000: 20 ``` --- ```ruby pry> File.open("hello.txt", "w:us-ascii") {|f| f.print "\x48\x65" } ``` ```sh /tmp > xxd hello.txt 00000000: 4865 He ``` And so we could continue with this until we got: ```sh /tmp > xxd hello.txt 00000000: 4865 6c6c 6f20 576f 726c 64 Hello World ``` --- # Let's Write Some Bytes I know you guys do this a lot for shareholder value. ```ruby # some random bytes pry> 209.to_s(16) => d1 pry> 209.chr => "\xD1" pry> puts 209.chr � ``` --- # Let's Try to Write Whatever That D1 Is ```ruby File.open("rus.txt", "w:koi8-r") do |file| file.print "\xD1" end Encoding::InvalidByteSequenceError: incomplete "\xD1" on UTF-8 ``` We need to tell ruby that these bytes aren't UTF-8. ```ruby File.open("rus.txt", "w:koi8-r") do |file| file.print "\xD1".force_encoding("koi8-r") end ``` To actually see it, you need to change your terminal to koi8. ![russian text displayed correctly](images/ruskie.png) --- # Phew, Review * strings have implicit encoding * strings are bytes * some bytes can't convert --- # Startup Idea! An API For Your Pants! ```ruby class PantsController < ApplicationController skip_before_filter :verify_authenticity_token, :only => [:create] respond_to :json def index respond_to do |format| format.json { render json: Pant.all } end end def create render text: request.body.to_s.encoding # => ASCII-8BIT end end ``` --- # People Post Their Pants! ``` $ curl -X POST -H 'Content-Type:application/json' -d '{"title":"Nice Pants", "size":38}' http://localhost:3000/pants ASCII-8BIT $ curl -X POST -H "Content-Type:application/json;charset=UTF-8" -H "Accept-Charset:UTF-8" -d '{"title":"Nice Pants", "size":38}' http://localhost:3000/pants ASCII-8BIT ``` ASCII-8BIT?? What? --- class: center,middle,inverse # .highlight-medium[WTF] ![that's horse hockey](images/horse_s.png) --- # Rails is Right No matter what you do, it's going to come across as binary. It's what http spec says. It's undefined. So you can force it if you want. But know what constraints you are introducing. --- # Data Corruption ```ruby # Let’s set up the failure scenario. wizard = "마법사" wizard.bytes File.open("/tmp/mysql-backup.sql", "w:UTF-8") {|file| file.puts wizard.force_encoding('iso-8859-1') } import = File.open("/tmp/mysql-backup.sql", encoding:Encoding::ISO_8859_1).readlines.first => "\xC3\xAB\xC2\xA7\xC2\x88\xC3\xAB\xC2\xB2\xC2\x95\xC3\xAC\xC2\x82\xC2\xAC\n" import.force_encoding('utf-8') # nope import.force_encoding('utf-8'). # undo the wrong file read encode('iso-8859-1'). # undo the file write force_encoding('utf-8') # undo the force in the file.puts block => "바나나\n" ``` --- # UTF-8 Is No Silver Bullet ```ruby require 'base64' encoded = Base64.encode64 'bacon is great' => "YmFjb24gaXMgZ3JlYXQ=\n" decoded = Base64.decode64(encoded) => "bacon is great" # Yay for ascii? ``` ## Wait a minute ... ```ruby encoded = Base64.encode64 '마법사' => "66eI67KV7IKs\n" decoded = Base64.decode64(encoded) decoded.force_encoding('utf-8') # The bytes didn't change, so force_encoding is correct here '마법사'.bytes ``` --- # The End * [Daniel Miessler's blog post](https://danielmiessler.com/study/encoding/) * [Encoding in Ruby and Everywhere](http://squarism.com/2015/07/08/encoding-in-ruby-and-everywhere/)