Be careful when switching MySQL to UTF-8

Earlier this week, I noticed a couple strange issues on this blog and sent an email to the roller-user mailing list. I figured both issues were caused by my upgrade to Roller 5. Basically, my tag cloud wasn't working and I noticed a bunch of blog entries that had truncated data. I'd provide links to the truncated posts, but I believe I've fixed them, so links would be useless. This post is to make others aware of something I wasn't: be careful when switching MySQL to UTF-8.

The first issue, 404s from my tag cloud links, was something my theme was missing. It seems the tagIndex action is new in Roller 5 and required for tag clouds to work. To fix this issue, I had to add the following XML to my theme's theme.xml file.

<template action="tagsIndex">
    <name>TagsIndex</name>
    <description>Tag index page</description>
        <link></link>
    <navbar>false</navbar>
    <hidden>true</hidden>
    <templateLanguage>velocity</templateLanguage>
    <contentType>text/html</contentType>
    <contentsFile>Weblog.vm</contentsFile>
</template>

Since I wanted to replicate Roller 4's behavior, pointing the contentsFile to Weblog.vm worked just fine. The nice thing is I can always change it to another page and customize it to show more information about the selected tag.

The 2nd issue, data truncation, was a bit trickier. I thought it might've been something Roller did when upgrading my database from Roller 4 to 5. I didn't suspect upgrading from MySQL 3 to 5 would cause it. From my previous upgrade post:

At this point, I figured my database might be slightly hosed, but since it was simply creating tables, I was probably OK. I restarted Tomcat and left the old version in place while I waited for a MySQL 5 database instance from my hosting provider, KGB Internet. Once I got the new instance, I imported my backed-up database, ran the upgrade script and everything worked just peachy.

Keith at KGB looked into my issue and thought the problem was the charset. My old MySQL 3.x database used latin1 while my MySQL 5.x database uses UTF-8. The symptom looked familiar:

Be careful when switching to UTF-8. Once you have converted your data, any program/webapp that uses the database will have to check that the data they are sending to the database is valid UTF-8. If it isn't then MySQL will silently truncate the data after the invalid part, which can cause all sorts of problems.

Luckily, I had a backup of my pre-upgrade database and was able to convert and recover everything successfully with a little iconv, perl and numerous mysqldump and mysql import commands. Of course, it's possible there's still some jacked entries and comments. If you noticed any truncation, please let me know and I'll get them fixed.

Posted in Roller at Apr 28 2011, 08:11:40 PM MDT 2 Comments
Comments:

Hey Matt, The JDBC driver won't truncate those characters, you'll get data truncation exceptions, because it runs with the database connection in sql_mode=strict_trans_tables. You could do the same thing when you restore a dump through "mysql" as well, to avoid truncations that are silent. -Mark

Posted by Mark Matthews on April 29, 2011 at 08:16 AM MDT #

I'm afraid your diagnostic is completely wrong Just read the MySQL doc, you'll learn that your app can still be in latin 1 or whatever encoding, even if the MySQL database stores text in UTF-8. You just have to configure your client and your server. Read about "SET NAMES", it should be useful too you. As suggested, learning about strict modes could also be a good idea.

Posted by Rogojine on May 01, 2011 at 05:53 PM MDT #

Post a Comment:
  • HTML Syntax: Allowed