We Don?t Need Encoding!

OK, I needed the title to be funny.  The truth is, I've just had 3 encoding issues thrown my way recently.  One was an internal bug and sadly, the other two are of my own doing.  One was in Wolfclock: I found a case when I didn't specify the encoding (and this may cause a problem internationally, as I came to find out from some users) and another was with my own site: I couldn't figure out what the problem was with my serialized data ... and, not surpisingly, it was the encoding.

Let's face it: in developing non-international software, developers seldom worry about encoding and of course never worry about internationalization.  You'll get up to speed pretty quick as a developer at Microsoft, though. 

Speaking as a develop who grew up in the U.S., we could more or less assume things would work.  I remember coding on a PC-XT and the days of ASCII.  Ah, ASCII.  In our attempt to make everything backward compatible, you can just about get away without knowing much about encoding these days -- that's because even if the encoding is technically wrong, more times than not you'll get the intended result.  Kind of scary, really.

If you develop code, HTML, XML, or anything else, you must read Joel Spolsky's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)."  Whew, that was a mouthful.  As Joel points out in his article, it's not that hard. 

As far as .NET is concerned, a string (System.String) is essentially an array of char, each character encoded as UTF-16 as 2 bytes (there's also surrogate pairs for Unicode characters beyond the 65,536 character boundary, as this requires more than 2 bytes).

The System.Text.Encoding class is your best friend when it comes to specifying character encoding (you can write your own derivative, but I'd imagine this is rarely needed).  Where developers often go wrong (IMHO) is not specifying the encoding for streams.  I wonder if people think, "Hmmm.  It's optional.  I'm not quite sure so let's just not specify one and hope it works out."  One tool had an issue because -- aside from implementation curiosities -- data encoded as UTF-16 was streamed to a file.  No encoding was specified so the default encoding -- UTF-8 -- was used.  This can (and did) go unnoticed for over a year.  That's the danger of just hoping for the best -- it may not be obvious, even at runtime, that something's askew.

I think it's OK to use ASCII (not even extended ASCII -- I'm talking 0-127) as long as you know you're using it.  I'd highly recommend always specifying the encoding even when it's not required (on streams, for example) -- it makes the code much easier to follow and the intention is obvious.

Thanks for a great article, Joel, I admit I thought Unicode had a 65k character limit!

Comments (2) -

Michael K. Campbell
Michael K. Campbell
10/25/2005 5:48:00 PM #

I'll have to go check that article out.

I seriously always find myself WISHING that I could find a seriously killer resource on encoding/streams/and all that goodness.

For some reason they didn't teach me much about it in college (it never
came up in Arabic or in any of my History of the Crusades classes,
etc.) Luckily I've had a decent amount of exposure to localization
xml/html/and even encryption so I get some of the basics -- it's just
not a second language for me yet...

One reason devs may not specify an encoding: what's the possibility
that many of them don't know what it is? get that 'deer in the
headlights' feeling, and just skip it if they can still compile it? I'm
waggering that explains SOME of it...

James Byrd
James Byrd
12/8/2005 4:36:29 PM #

I'm guessing that most people don't think about encoding because it is never treated as an issue when you are learning how to code .NET.

When I took the Guerrilla .NET class from DevelopMentor, I don't remember them saying anything about it, but there was enough information coming my way that I could have missed it.

Just for the heck of it, I looked through three of my .NET programming books to see how important they considered the topic. I did not find even one reference to encoding in the contents or index of Essential ASP.NET, Essential .NET (vol 1), or Programming C#. I can only conclude that understanding encoding is not essential for working with .NET! Wink

Comments are closed

My Apps

Dark Skies Astrophotography Journal Vol 1 Explore The Moon
Mars Explorer Moons of Jupiter Messier Object Explorer
Brew Finder Earthquake Explorer Venus Explorer  

My Worldmap

Month List