Mon 20 Nov 2006
UTF-8, a Unicode encoding, is probably already the most used character encoding for new web applications, except maybe for Asia. The most popular open source database is MySQL. (But don’t miss the most advanced open source database, which I prefer.)
What do you need to do to have your database and web application be all UTF-8? MySQL offers a lot of places to configure character set (character encoding) and character collation (used for sorting and comparing text).
The most important rule is: Don’t rely on server configuration – you may not control it on the server you application will be running. When you write a web application that needs to run on a variety of operating systems and Linux distributions, all with their own default database configuration, you must make as little assumptions as possible about the system it will run on and its configuration. You may have configured your MySQL server to run your application correctly, but you may not have permission or the opportunity to reconfigure the MySQL server your application will run on. Fortunately you can specify all character set configuration in MySQL in places that you control: in the SQL scripts and the source code of your web application.
One side of the problem is the data in the tables. If you control the CREATE DATABASE statement, you should specify the character set there:
CREATE DATABASE webapp DEFAULT CHARACTER SET utf8;
On some web hosts, web applications have to use the one database that came with the hosting. The database has already been created for you. In that case you have to specify the character set in the individual CREATE TABLE statements:
CREATE TABLE gadgets ( name VARCHAR(255) PRIMARY KEY, rating INT ) DEFAULT CHARACTER SET utf8;
It doesn’t hurt to specify the character set in both locations. (Except you’re violating the DRY Principle.) Now MySQL knows that the data in your tables is in UTF-8.
Communication with the database
The other side of the problem is the data that comes from and gets sent to the client. MySQL offers a lot of features here; you can have different character sets at almost every stage of data processing. To be all UTF-8, issue the following statement just after you’ve made the connection to the database server:
SET NAMES utf8;
This sets the character_set_client, character_set_connection and character_set_results variables to utf8. See below for the meaning of each of these variables.
Communication with the database also concerns SQL files you read with the MySQL command line client, or upload with phpMyAdmin. Put the statement at the top of every SQL file, like this:
SET NAMES utf8; INSERT INTO TABLE gadgets (name, rating) VALUES ('iPod', 45);
If you’re talking to mysql from a command line that doesn’t understand UTF-8 (likely in Windows and older Linuxes), use the following statement to tell MySQL which character set you’re using on the client side:
SET CHARACTER SET cp1250;
This sets the character_set_client and character_set_results variables to cp1250. Upon arriving on the server, your data will be converted from CP1250 to UTF-8. Results returned to you will be converted from UTF-8 to CP1250.
Which character set do you need for the command line?
- Modern Linux (Fedora, Ubuntu since 5.04, recent SuSE, recent Mandriva) for all languages. No need to issue a SET CHARACTER SET command here.
- Western European Windows and Linux. English, Spanish, German, French, …
- Central European Windows. Polish, Czech, Slovak, Hungarian, Slovene, Croatian, Romanian, Albanian.
- Central European Linux. Polish, Czech, Slovak, Hungarian, Slovene, Croatian, Romanian, Albanian.
- Cyrillic Windows. Russian, Ukrainian, Bulgarian, Belorussian, …
- Russian Linux.
- Ukrainian Linux.
- Arabic Windows.
- Baltic Windows. Estonian, Latvian, Lithuanian.
- Baltic Linux. Estonian, Latvian, Lithuanian.
- Turkish Linux.
Character set and collation variables
What do these character_set_* variables mean?
Variables concerning communication
- This informs MySQL about the character set data from the client is encoded in.
- This is the character set MySQL converts incoming data to. It converts from character_set_client.
- This is the character set MySQL converts outgoing data to. It converts from the character set specified for the data in the database / tables / columns.
Variables concerning data storage
- The character set used for new databases if none is specified in the CREATE DATABASE statement. It can be set from the server configuration file, on the command line for mysqld and interactively in a database session.
- The character set used for new tables if none is specified in the CREATE TABLE statement. It is set to character_set_server by default.
- The character set for meta-data: database, table and column names. Its value is always utf8.
Please leave a comment if this information was helpful to you. Also don’t hesitate to ask questions concerning the above. I’m not the best writer and some things are probably phrased incomprehensibly.
Last but not least I’m always interested in other topics concerning web programming that I could write about. What are you most interested in? What information do you need most?