I decided to give this a blog post of its own.
Time for another revelation: Charsets are really a pain in the ass. Correction: Multiple charsets are really a pain in the ass.
As I've mostly done single scripts for myself previously, it really hasn't hit me before. I guess my subconscious has always been aware of the problem, and naturally I've encountered some glitches when transferring stuff from a server to another, correcting them individually without giving it much thought. Only now have I started to realize its utter, shall we say, idiocy.
To give a brief and very redundant recap, charsets (or character sets) are (mostly) a set of mappings between zero/one combinations (bits) and symbols. Each charset has its own idea of which combination of bits represent each symbol. The de-facto charset in the HTML realm is called ISO-8859-1, and all documents are assumed to be of that charset, unless specified otherwise.
Figuratively speaking, the website throws the client a heap of bits, saying that those contain whatever you should read on a website. It also says which set of mappings you should use to interpret those zeros and ones to get the intended text.
So, it's easy to see that if you change either one, the contents of that website changes. If you alter the heap of bits, but still interpret it with the same charset, you get different characters where the bits have changed in the heap. If you change the charset, you change potentially the whole text, and most probably into garbage.
The obvious problems, like knowing which charset a document is, and telling it to the webserver, aside, there's one great problem that really could wreak havoc: If you blindly combine documents of different charsets into one, it's next to impossible to get sensible text out of that. It might sound like a far fetched scenario, but it's not all that obscure.
There are two prevalent charsets in use when it comes to web development, ISO-8859-1 and UTF-8, and those are totally incompatible. Now, imagine that I want to be future proof, and write all my code in UTF-8, as it's more expansive and includes more characters than ISO-8859-1. But then I get another developer that writes all his code in the old standard, ISO-8859-1. It takes only one include-statement, and that's it.
Contain the character set - obvious. Let's say that LightFrame forces everyone to use UTF-8. Ok, that's settled. Then, when LF ships, a user has ISO-8859-1 set, and all his templates look funny. Or vice versa, we use ISO-8859-1, and a user has UTF-8. Either way, it's screwed. If, miraculously, that gets a consensus, along comes a site visitor who's browser says "nuh-uh, I can't/won't read your character encoding". What then?
One solution would be to screw it all, and have the user keep the tabs on stuff. If it doesn't work, it's not my problem. But I wouldn't feel happy about that. LightFrame tries to be the magic wand that fixed the tedious. LF has abstracted SQL, it even has O/R mapping on top of that. It will have user authorization and authentication. In a nutshell, LF is designed to make the developer's life easier. Fixing charsets would definitely make a web developer's life easier.
So, now I'm forced to incorporate text normalization, somehow. Something that would make everything click. Let everyone make a mess out of their code, and still get everything look as it was intended from the start. There are no silver-bullets. LF requires a minimal feature set on the server side, while supporting a much larger feature set. So, PHP does provide two libraries, so a "if you have it, we support it. Otherwise, you're on your own"-approach might be a possibility. But then again, some servers might have one, but not the other. *sigh* I need a lie-down.
I hear that PHP6 is going to take a stab at fixing this, and I tip my hat off for that. But it won't help me, perhaps on the contrary. PHP5 barely has a foothold, as PHP4 still is the prevalent major PHP version out there, so there's no chance that PHP6 gets adopted as soon as I'd like. Not that I would know when PHP6 is scheduled for publication.
Monday, April 21, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment