UTF-8, a Unicode encoding, is probably already the most used character encoding for new web applications, except maybe for Asia. The most popular open source database is MySQL. (But don’t miss the most advanced open source database, which I prefer.)

What do you need to do to have your database and web application be all UTF-8? MySQL offers a lot of places to configure character set (character encoding) and character collation (used for sorting and comparing text).

The most important rule is: Don’t rely on server configuration – you may not control it on the server you application will be running. When you write a web application that needs to run on a variety of operating systems and Linux distributions, all with their own default database configuration, you must make as little assumptions as possible about the system it will run on and its configuration. You may have configured your MySQL server to run your application correctly, but you may not have permission or the opportunity to reconfigure the MySQL server your application will run on. Fortunately you can specify all character set configuration in MySQL in places that you control: in the SQL scripts and the source code of your web application.

Stored data

One side of the problem is the data in the tables. If you control the CREATE DATABASE statement, you should specify the character set there:

CREATE DATABASE webapp
        DEFAULT CHARACTER SET utf8;

On some web hosts, web applications have to use the one database that came with the hosting. The database has already been created for you. In that case you have to specify the character set in the individual CREATE TABLE statements:

CREATE TABLE gadgets (
    name VARCHAR(255) PRIMARY KEY,
    rating INT
) DEFAULT CHARACTER SET utf8;

It doesn’t hurt to specify the character set in both locations. (Except you’re violating the DRY Principle.) Now MySQL knows that the data in your tables is in UTF-8.

Communication with the database

The other side of the problem is the data that comes from and gets sent to the client. MySQL offers a lot of features here; you can have different character sets at almost every stage of data processing. To be all UTF-8, issue the following statement just after you’ve made the connection to the database server:

SET NAMES utf8;

This sets the character_set_client, character_set_connection and character_set_results variables to utf8. See below for the meaning of each of these variables.

Communication with the database also concerns SQL files you read with the MySQL command line client, or upload with phpMyAdmin. Put the statement at the top of every SQL file, like this:

SET NAMES utf8;
INSERT INTO TABLE gadgets (name, rating) VALUES ('iPod', 45);

If you’re talking to mysql from a command line that doesn’t understand UTF-8 (likely in Windows and older Linuxes), use the following statement to tell MySQL which character set you’re using on the client side:

SET CHARACTER SET cp1250;

This sets the character_set_client and character_set_results variables to cp1250. Upon arriving on the server, your data will be converted from CP1250 to UTF-8. Results returned to you will be converted from UTF-8 to CP1250.

Which character set do you need for the command line?

utf8
Modern Linux (Fedora, Ubuntu since 5.04, recent SuSE, recent Mandriva) for all languages. No need to issue a SET CHARACTER SET command here.
latin1
Western European Windows and Linux. English, Spanish, German, French, …
cp1250
Central European Windows. Polish, Czech, Slovak, Hungarian, Slovene, Croatian, Romanian, Albanian.
latin2
Central European Linux. Polish, Czech, Slovak, Hungarian, Slovene, Croatian, Romanian, Albanian.
cp1251
Cyrillic Windows. Russian, Ukrainian, Bulgarian, Belorussian, …
koi8r
Russian Linux.
koi8u
Ukrainian Linux.
cp1256
Arabic Windows.
cp1257
Baltic Windows. Estonian, Latvian, Lithuanian.
latin7
Baltic Linux. Estonian, Latvian, Lithuanian.
latin5
Turkish Linux.

Character set and collation variables

What do these character_set_* variables mean?

Variables concerning communication

character_set_client
This informs MySQL about the character set data from the client is encoded in.
character_set_connection
This is the character set MySQL converts incoming data to. It converts from character_set_client.
character_set_result
This is the character set MySQL converts outgoing data to. It converts from the character set specified for the data in the database / tables / columns.

Variables concerning data storage

character_set_server
The character set used for new databases if none is specified in the CREATE DATABASE statement. It can be set from the server configuration file, on the command line for mysqld and interactively in a database session.
character_set_database
The character set used for new tables if none is specified in the CREATE TABLE statement. It is set to character_set_server by default.
character_set_system
The character set for meta-data: database, table and column names. Its value is always utf8.

Feedback

Please leave a comment if this information was helpful to you. Also don’t hesitate to ask questions concerning the above. I’m not the best writer and some things are probably phrased incomprehensibly.

Last but not least I’m always interested in other topics concerning web programming that I could write about. What are you most interested in? What information do you need most?

Links