programming


Ok, as I promised, here is how to map CapsLock to Escape in Xorg. This is especially useful for folks who use vim, as CapsLock in on the home row, and Escape is not. Actually escape is that far away that you have to have really long fingers in order to hit escape without moving your hand from its usual position.

Paths and line numbers are given for Ubuntu 7.04 Feisty Fawn, so if you’re on another distribution or another version of Ubuntu, your mileage may vary.

What we’ll do is add an XkbOption to the keyboard configuration files. At first, add the following at the end of /usr/share/X11/xkb/symbols/capslock:

partial hidden modifier_keys
xkb_symbols "escape" {
    key <CAPS> {        [       Escape  ]       };
    key <ESC>  {        [       None    ]       };
};

This defines an option that maps the CAPS and ESC key codes to the Escape and None symbols respectively. If you don’t want to disable the original Escape key, leave the corresponding line out. Now let’s give it an option name: insert the following line to /usr/share/X11/xkb/rules/base, somewhere around line 810, where the other capslock lines are:

caps:escape          =       +capslock(escape)

Now let’s add that option to the keyboard configuration in /etc/X11/xorg.conf: Find the InputDevice section for the keyboard and add the following line to the section:

        Option          "XkbOptions"    "caps:escape"

Now restart X by logging out and typing Ctrl-Alt-Backspace at the login prompt. That’s it. CapsLock is gone and on its place Escape stood in.

I wasn’t successful in convincing the XKeyboardConfig maintainer Sergej Udaltsov to accept a patch for this, but I attempted only once yet :)

Although I’m from Germany, I live in Novosibirsk at the moment. Novosibirsk is in Russia, so I listen to russian music. The player I use is muine. Unfortunately the artist and title information looks like this:

The reason is that windows software that adds meta tags to music files uses the default russian 8-bit encoding CP1251. All ID3 versions except for the newest ones only allow ISO-8859-1 as the tag encoding. So muine, according to the standard, interprets the tags in ISO-8859-1. Let’s change that.

I’m using Ubuntu 6.10. Let’s have a look at the muine sources:

~/src/deb$ apt-get install build-essential
~/src/deb$ apt-get source muine
...
dpkg-source: extracting muine in muine-0.8.5
dpkg-source: unpacking muine_0.8.5.orig.tar.gz
dpkg-source: applying ./muine_0.8.5-1ubuntu4.diff.gz
~/src/deb$ cd muine-0.8.5/
~/src/deb/muine-0.8.5$ ls src/
...
AddWindowEntry.cs             DndUtils.cs      Metadata.cs          SkipToWindow.cs
...

The file Metadata.cs looks like it’s responsible for the ID3 tags. Searching it for title shows the following lines:

                // Properties :: Title (get;)
                [DllImport ("libmuine")]
                private static extern IntPtr metadata_get_title (IntPtr metadata);

DllImport imports a binary library file. The next line declares a function metadata_get_title which is implemented in libmuine. Let’s look at that.

~/src/deb/muine-0.8.5$ ls libmuine/
...
gsequence.c  metadata.c   player-gst-0.8.c  rb-cell-renderer-pixbuf.c
...

Searching for title in metadata.c gives us the following line:

        metadata->title = get_mp3_comment_value (tag, ID3_FRAME_TITLE, 0);

Which leads us to get_mp3_comment_value. Let’s look at its definition:

get_mp3_comment_value (struct id3_tag *tag,
                       const char *field_name,
                       int index)
{
...
        frame = id3_tag_findframe (tag, field_name, 0);
...
        field = id3_frame_field (frame, 1);
...
        ucs4 = id3_field_getstrings (field, index);
...
        utf8 = id3_ucs4_utf8duplicate (ucs4);
...
}

get_mp3_comment_value calls a lot of functions the name of which starts with id3. The functions are not defined in metadata.c. They aren’t defined anywhere in the muine source code:

~/src/deb/muine-0.8.5$ grep -r id3_field_getstrings .
./libmuine/metadata.c:  latin1 = id3_ucs4_latin1duplicate (id3_field_getstrings (field, 0));
./libmuine/metadata.c:  ucs4 = id3_field_getstrings (field, index);

Only calls to that function. In metadata.c there’s an include statement that includes id3tag.h. Looks like what we need. Let’s download the source for the corresponding library:

~/src/deb/muine-0.8.5$ apt-cache search id3tag
libid3tag0 - ID3 tag reading library from the MAD project
libid3tag0-dev - ID3 tag reading library from the MAD project
mp3rename - Rename mp3 files based on id3tags
somaplayer - player audio for the soma suite
~/src/deb/muine-0.8.5$ cd ..
~/src/deb$ apt-get source libid3tag0
...
dpkg-source: extracting libid3tag in libid3tag-0.15.1b
dpkg-source: unpacking libid3tag_0.15.1b.orig.tar.gz
dpkg-source: applying ./libid3tag_0.15.1b-8.diff.gz
~/src/deb$ cd libid3tag-0.15.1b/
~/src/deb/libid3tag-0.15.1b$ grep -r id3_field_getstrings .
...
./field.c:id3_ucs4_t const *id3_field_getstrings(union id3_field const *field,
...

The function is defined in field.c. It accesses an array stringlist. That array is filled in the function id3_field_parse. This function calls another function, id3_parse_string that extracts the string values of a field.

~/src/deb/libid3tag-0.15.1b$ grep -r id3_parse_string *
parse.c:id3_ucs4_t *id3_parse_string(id3_byte_t const **ptr, id3_length_t length,
parse.h:id3_ucs4_t *id3_parse_string(id3_byte_t const **, id3_length_t,

This function is defined in parse.c. For ISO-8859-1 fields it calls id3_latin1_deserialize.

~/src/deb/libid3tag-0.15.1b$ grep -r id3_latin1_deserialize *
latin1.c:id3_ucs4_t *id3_latin1_deserialize(id3_byte_t const **ptr, id3_length_t length)
...

id3_latin1_deserialize is defined in latin1.c. It calls id3_latin1_decode to convert the latin1 string to UCS-4, which in turn calls id3_latin1_decodechar to convert a single character. We’re there: we have found the place we have to change:

/*
 * NAME:        latin1->decodechar()
 * DESCRIPTION: decode a (single) latin1 char into a single ucs4 char
 */
id3_length_t id3_latin1_decodechar(id3_latin1_t const *latin1,
                                   id3_ucs4_t *ucs4)
{
  *ucs4 = *latin1;

  return 1;
}

The function is very simple: ISO-8859-1 is a subset of unicode, so only a direct assignment is needed. For CP1251 things are different. Looking at the wikipedia page for CP1251, we see that the letters of the russian alphabet start at 0xC0 with the upper case letters, followed by the lower case letters to 0xFF. Using gnome-character-map, we find that the corresponding unicode code points are U+0410 through U+044F and that the letters are in the same order. Very convenient. Let’s change the function to return the correct unicode values for the CP1251 letters 0xC0 through 0xFF:

id3_length_t id3_latin1_decodechar(id3_latin1_t const *latin1,
				   id3_ucs4_t *ucs4)
{
  if (*latin1 >= 0xc0)
    *ucs4 = 0x410 + (*latin1 - 0xc0);
  else
    *ucs4 = *latin1;

  return 1;
}

The unicode encoding used here, UCS-4, just packs the unicode code point in a 32 bit integer, so we can just directly assign the unicode value. Now on to compiling the changed libid3tag.

~/src/deb/libid3tag-0.15.1b$ sudo apt-get build-dep libid3tag0
...
~/src/deb/libid3tag-0.15.1b$ sudo apt-get install fakeroot
...
~/src/deb/libid3tag-0.15.1b$ fakeroot dpkg-buildpackage -uc -us
...
~/src/deb/libid3tag-0.15.1b$ sudo dpkg -i ../libid3tag0_0.15.1b-8_i386.deb
...

Now let’s delete the muine song database so that it re-reads the metadata.

~/src/deb/libid3tag-0.15.1b$ rm ~/.gnome2/muine/*

Start muine and import the music file:

Победа! :)

Update: Added install of build-essential at the beginning.

In short: Before you switch between branches, delete all files that are ignored by subversion. If you don’t, you will receive errors like these:

svn: Won't delete locally modified directory '.'
svn: Left locally modified or unversioned files

svn status looks weird after such a failed switch:

markus@markus:/var/www/community$ svn status
!      .
    S  app
!   S  admin
!      admin/tmp
?      admin/tmp/cache/models
!      admin/tmp/cache
    S  sql
...

What’s going on? Let’s assume that I checked out a branch and want to switch to the trunk. In the branch a directory admin/tmp/cache/models was added and its svn:ignore was set to *, that is, all files inside that directory are ignored. Now there appeared files inside this directory. I issue the switch to branch command and got the above error message. That means, that there were some files inside models that are ignored by subversion. The switch operation deleted the .svn directory in admin/tmp/cache/models but didn’t remove the directory itself because it won’t delete files that it doesn’t control – specificially the ignored files in admin/tmp/cache/models. How to get out of that situation?

At first, switch back to where you were, i.e. to the branch. This will give you the following error:

...
svn: Failed to add directory 'admin/tmp/cache/models': object of the same name already exists

svn status looks much better now:

markus@markus:/var/www/community$ svn status
!      .
!      admin
!      admin/tmp
?      admin/tmp/cache/models
!      admin/tmp/cache

Now delete the unversioned models directory and switch again to the branch:

...
A    admin/tmp/cache/models
...

Now delete all ignored files by hand. To find out which files exist that are ignored, use svn status --no-ignore. Note that ignored files interfere with the switch only if they are in a directory that exists only in the branch. Now the switch to the trunk should work without a glitch.

UTF-8, a Unicode encoding, is probably already the most used character encoding for new web applications, except maybe for Asia. The most popular open source database is MySQL. (But don’t miss the most advanced open source database, which I prefer.)

What do you need to do to have your database and web application be all UTF-8? MySQL offers a lot of places to configure character set (character encoding) and character collation (used for sorting and comparing text).

The most important rule is: Don’t rely on server configuration – you may not control it on the server you application will be running. When you write a web application that needs to run on a variety of operating systems and Linux distributions, all with their own default database configuration, you must make as little assumptions as possible about the system it will run on and its configuration. You may have configured your MySQL server to run your application correctly, but you may not have permission or the opportunity to reconfigure the MySQL server your application will run on. Fortunately you can specify all character set configuration in MySQL in places that you control: in the SQL scripts and the source code of your web application.

Stored data

One side of the problem is the data in the tables. If you control the CREATE DATABASE statement, you should specify the character set there:

CREATE DATABASE webapp
        DEFAULT CHARACTER SET utf8;

On some web hosts, web applications have to use the one database that came with the hosting. The database has already been created for you. In that case you have to specify the character set in the individual CREATE TABLE statements:

CREATE TABLE gadgets (
    name VARCHAR(255) PRIMARY KEY,
    rating INT
) DEFAULT CHARACTER SET utf8;

It doesn’t hurt to specify the character set in both locations. (Except you’re violating the DRY Principle.) Now MySQL knows that the data in your tables is in UTF-8.

Communication with the database

The other side of the problem is the data that comes from and gets sent to the client. MySQL offers a lot of features here; you can have different character sets at almost every stage of data processing. To be all UTF-8, issue the following statement just after you’ve made the connection to the database server:

SET NAMES utf8;

This sets the character_set_client, character_set_connection and character_set_results variables to utf8. See below for the meaning of each of these variables.

Communication with the database also concerns SQL files you read with the MySQL command line client, or upload with phpMyAdmin. Put the statement at the top of every SQL file, like this:

SET NAMES utf8;
INSERT INTO TABLE gadgets (name, rating) VALUES ('iPod', 45);

If you’re talking to mysql from a command line that doesn’t understand UTF-8 (likely in Windows and older Linuxes), use the following statement to tell MySQL which character set you’re using on the client side:

SET CHARACTER SET cp1250;

This sets the character_set_client and character_set_results variables to cp1250. Upon arriving on the server, your data will be converted from CP1250 to UTF-8. Results returned to you will be converted from UTF-8 to CP1250.

Which character set do you need for the command line?

utf8
Modern Linux (Fedora, Ubuntu since 5.04, recent SuSE, recent Mandriva) for all languages. No need to issue a SET CHARACTER SET command here.
latin1
Western European Windows and Linux. English, Spanish, German, French, …
cp1250
Central European Windows. Polish, Czech, Slovak, Hungarian, Slovene, Croatian, Romanian, Albanian.
latin2
Central European Linux. Polish, Czech, Slovak, Hungarian, Slovene, Croatian, Romanian, Albanian.
cp1251
Cyrillic Windows. Russian, Ukrainian, Bulgarian, Belorussian, …
koi8r
Russian Linux.
koi8u
Ukrainian Linux.
cp1256
Arabic Windows.
cp1257
Baltic Windows. Estonian, Latvian, Lithuanian.
latin7
Baltic Linux. Estonian, Latvian, Lithuanian.
latin5
Turkish Linux.

Character set and collation variables

What do these character_set_* variables mean?

Variables concerning communication

character_set_client
This informs MySQL about the character set data from the client is encoded in.
character_set_connection
This is the character set MySQL converts incoming data to. It converts from character_set_client.
character_set_result
This is the character set MySQL converts outgoing data to. It converts from the character set specified for the data in the database / tables / columns.

Variables concerning data storage

character_set_server
The character set used for new databases if none is specified in the CREATE DATABASE statement. It can be set from the server configuration file, on the command line for mysqld and interactively in a database session.
character_set_database
The character set used for new tables if none is specified in the CREATE TABLE statement. It is set to character_set_server by default.
character_set_system
The character set for meta-data: database, table and column names. Its value is always utf8.

Feedback

Please leave a comment if this information was helpful to you. Also don’t hesitate to ask questions concerning the above. I’m not the best writer and some things are probably phrased incomprehensibly.

Last but not least I’m always interested in other topics concerning web programming that I could write about. What are you most interested in? What information do you need most?

Links

Next Page »