Sunday, February 18, 2007

Myth #6 - Web page authors, this one's fer you


For goodness' sake, how many myths are there? The answer is, as many as I have heard. But truly, there are fewer than 15*, so read on! This is one that keeps popping up like a bad penny:
"ISO-8859-1 is the standard encoding for HTML."
Sooooo, does that mean all those Web pages in Japanese and Chinese are a bunch of standard-violating hacks? No, of course not. It is perfectly legal to use any charset in a Web page, but it should be declared. Why? Because ISO-8859-1 is the default charset for HTML (yes, even in HTML 4.0). That means if you don't declare the charset of your page, a browser (or any other HTML interpreter) is free to assume that it's in ISO-8859-1. Now, admittedly, in practice browsers make other assumptions. Typically you set a preference for a default charset (or character encoding, if you prefer). This is sometimes set based on the localization you install; for example, if you install a Russian version of the browser, it may set the default charset as "KOI8-R". But the point is that assumptions will be made, unless you declare the charset in your document. And it's very straightforward. Just put a META tag as the first tag in the HEAD section, like so:
<META HTTP-EQUIV="Content-type" VALUE="text/html; charset=utf-8">
Simple, right? Oh yes, and I sneaked in a better charset to "default" to - UTF-8. UTF-8 is an encoding of Unicode, nearly universally supported, covering most of the living languages of the world. Use it and all your cares will be over - uh oh, see Myth #4.
* were fewer than 15