If a system’s glitches can be compared to fish, I want to tell you about my white whale.
A while back, I was working on a system feature that read in some XML from the filesystem, XSLT’d it into HTML, and served it up to a browser. The XML had a bunch of characters from the higher Unicode ranges (i.e., >255), and wouldn’t you know, when viewed in a browser, these characters showed up as garbled data. Not “The Box”–that ugly little placeholder used when a font doesn’t contain a character for a given code point–but usually one to three seemingly random characters that had nothing to do with the character that was supposed to be displayed.
Classic encoding problem.
For the uninitiated in character encodings, let me fill you in real quick. Disks store bytes, not characters. A byte is a numeric value between 0 and 255. To store characters on disks, a convention is used to map the numeric values of bytes to characters. In the early days of computing, we kept things simple and said that there could be no more than 256 different types of characters stored in files. Lately, we’ve taken to storing over 60,000 different types of characters. How do we represent that many values with just a byte?
Actually, that depends. An exceedingly large number of different conventions exist for mapping >256 characters to bytes. What all of these systems have in common is that multiple bytes are used to represent a single character. Two bytes can when used together represent 65,536 unique character types; with three bytes, bump that up to 16 million.
And therein lies the rub. Files don’t indicate the encoding used within them. Indeed, there’s no guarantee that the files store character values at all. The user must know what to expect within the file, and if its character data, they must know what encoding was used to store it.
Back to the story. I knew it was an encoding glitch; multiple characters showing up in place of one is a classic symptom (because multiple bytes represented the character, but the parser treated each byte as a unique character). I immediately assumed that the browser or the servlet (or the web framework on top of it) was to blame. I spent a lot of time educating myself on how encodings work over the web. I threw hours at the problem here and there and came up empty handed each time.
And then, whilst reading through some of the backend code, I saw this innocuous little line:
Document document = new SAXBuilder().build(new FileReader(file));
See the problem? Look again. Notice the
FileReader? I’m such an idiot. Here’s the deal. XML files can contain any of thousands of different Unicode characters and can use a bunch of different encodings to map those to bytes. The encoding used on a particular XML document is indicated in the prolog, such as:
<?xml version="1.1" encoding="UTF-8"?>
I don’t really use XML 1.1; I just put that in to piss off Elliotte. 😉 Note the encoding. Now, back to our
Readers in Java are nice because they handle converting bytes into characters automatically. But in order to do that, they have to know what encoding was used on the bytes they are being handed. If you don’t specify an encoding, a
Reader will use the operating system’s default encoding.
Ahhh, and there’s our problem. PCs, Macs, *nix, they all use different encoding schemes by default, and they ain’t UTF-8 (actually, on some *nixs it might be, I dunno). My XML files were UTF-8 encoded. So when I used a
Reader to parse my XML file, the XML parser was misinterpreting many of my characters.
This is the code I should have written:
Document document = new SAXBuilder().build(new FileInputStream(file));
If you hand an XML parser bytes, which is the currency of
InputStreams, the parser handles converting those bytes to characters itself, and uses the encoding in the XML prolog to configure itself for that process. If you hand it characters… it’s stuck using those characters and can’t affect the decoding process one whit, since it occurs a level beneath it.
It turns out this is a rather insidious bug. Because most encodings are the same in how they assign the characters mapped to byte values 0-255 (since the ASCII standard was so pervasive), and because those are by far the most common characters for most folks here in the United States, you can go a long way with character encoding bugs like this and never know any different. But the day you add a higher value character… weird things happen.
Learn from me. Spare yourself the pain of wrestling with this one yourself. Make me feel my time was well spent. Never, ever use a
Reader to parse in an XML file. There’s already a great system for letting the parser handle the decoding; let it.