Friday, November 11, 2011

How Rosie O'Donnell and Microsoft Conspired Against Me

It seems that in order to get to the bottom of a problem, and really understand what is going on, you have to be willing to continue to dig.  When I got to the bottom of my problem with Rosie O'Donnell, it turned out that Microsoft had assisted in a silent search and replace conspiracy.

I'm one of the creators and maintainers of the open source DVD Store database test kit and application.  Dave Jaffe and I created it for internal tests and then decided to enhance it a bit more and release it as open source in about 2005.  If you like to do database testing, you should really try it out.

As part of the data for this test database we created a program that generates fake titles and actors for the DVDs in our test database.  We take real names and then mix them up and recombine them to create all of our fake names.  We have a bunch of first names - TOM, ROSIE, BRAD - and we have a bunch of last names - CRUISE, O'DONNELL, PITT.  These are then randomly combined to create lots of names.  So we would have ROSIE CRUISE, BRAD O'DONNELL, and TOM PITT for example.

This same set of data has been in use with the DVD Store for at least 7 years, but this week I found out something new -  Not all apostrophe's are created equal.

When editing a data load file, with vi on Linux, that had our fictional actors names in it, I noticed that instead of an apostrophe we had the following O<92>DONNELL.  This was consistent throughout the file every time O'DONNELL was supposed to appear.

I looked at the same file on Windows in notepad and it appeared correctly - O'DONNELL.  The strange this was that if you used the keyboard to type a single quote (the key just to the right of the semicolon key) that the character it produced looked slightly different than the on next to the O in O'DONNELL in the file.

Why was Rosie O'Donnell causing these problems with an apostrophe that wasn't an apostrophe?

It turns out that it is Microsoft's fault.  It is a known issue for those that are into these types of things.  Microsoft will replace an entered single quote character with a "Microsoft Apostrophe" that is coded as a 92.  Microsoft will then display this as something that looks more like a real apostrophe, but at the cost of putting a non-defined character code into your text.  Here is a quote from wikipedia on the subject:

"Microsoft Windows CP1252 (sometimes incorrectly called ANSI or ISO-Latin) contains the typographic apostrophe at 0x92. Due to "smart quotes" in Microsoft software converting the ASCII apostrophe to this value, other software makers have been forced to adopt this as a de facto convention. For instance the HTML 5 standard specifies that this value is interpreted as CP1252. Some earlier non-Microsoft browsers would display a '?' for this and make web pages composed with Microsoft software somewhat hard to read."

It turns out that we initially entered the names for this data in Microsoft Word, where the silent search and replace of our apostrophe's conspired with Rosie O'Donnell against me. 

Be careful out there intrepid geeks - not all apostrophe's are created equal.

Todd

1 comment:

Jignesh Shah said...

Truly when I first hit that problem last year with PostgreSQL, it boggled me for a while. Even after changing the database from SQL_ASCII to UTF8 it did not help. Using "more" or "less" also did not help. Only when I vi'ed the whole data file and saw the different colored <0x92> is when I realized something was wrong out there.. Strange on how other databases accept that character for locales where it it not valid.