Acko.net

Proposal for Implementing Unicode in PHP

2005-06-03T00:00:00+02:00

On the Drupal team, I am known as an encoding nut: whenever there's an encoding issue or a question about Unicode, people tend to knock on my door. Usually any fix or answer from me is accompanied by a lot of cursing to the unfortunate inquirer about how "PHP is horrible when it comes to string handling" and how it seems that "the entire PHP dev team has its head planted firmly into the ground when it comes to Unicode".

To which the reply is more than often: "Why don't you fix it yourself?".

Well, I'm not a PHP language developer. To be honest I have no interest or time for becoming one. But I do know a lot about encodings and Unicode, so I decided to write this article describing the problem and possible solutions. That way, maybe others can take some of these ideas and put them into practice. At the very least, it should answer a lot of questions that people have about Unicode and PHP.

Right now, the message from the PHP developers seems to be that "PHP supports Unicode, but some assembly is required". In fact, it is a lot worse. Please, read on.

About encodings and Unicode

First, I recommend that anyone reading this article first reads The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky. It is an excellent introduction to Unicode and encodings in general. Note also that the article was written in 2003 and specifically mentions PHP's Unicode support being hopeless. We are now two years later and the situation has not changed much.

The only important thing about Unicode which isn't explained in Joel's article is that Unicode is in fact more than just a big table which maps characters to numbers: it is also a set of character properties, recommendations and algorithms on how those characters should be used. And this is why Unicode needs (and deserves!) much more attention than any other character set.

What is the current situation?

As far as PHP is concerned at the moment, a character consists of 8 bits and a string is a series of characters. This is good enough for legacy 8-bit encodings (like the common ISO-8859-1 or Latin-1 encoding used in Western Europe), but does not cater to more complicated encodings.

To accomodate those, the multibyte string extension (Mbstring) can be used. This extension was originally developed for handling Japanese encodings, but it has now been extended to support many more encodings, including the Unicode Transformation Formats (like the popular UTF-8). Mbstring provides encoding-aware versions of many of PHP's string functions (substr(), strlen(), ereg(), ...). Through a feature called overloading, you can tell PHP to always use the Mbstring version of a function if there is one.

Aside from Mbstring, there are a few other libraries and extensions which may be used to provide encoding- and Unicode-related services, like Imap, Iconv or GNU Recode.

What problems are there with the current approach?

PHP itself still doesn't know anything about encodings or Unicode. Aside from function calls, there are other ways of interacting with strings in PHP. For example, there is the {} operator for selecting characters from strings, as if they were arrays. And like in most programming languages, you can define strings in code with the familiar quote syntax. But all of these methods work with literal bytes, not with actual encoded characters.

PHP source code itself must be encoded in an ASCII-compatible encoding and there is no way to use Unicode codepoints directly. If you want to store a character in a variable, you either have to use a short string of bytes (the encoded representation of the character) or an integer representing the character's Unicode codepoint. But converting between a codepoint and its encoded representation requires ugly work-arounds and wrappers, as PHP itself provides no easy mechanism for doing this.
PHP does not guarantee anything about the local setup as far as encoding support goes. All the actual encoding functionality is located in libraries or extensions which may not be present on the average PHP install or which may be outdated. This makes it very difficult to make Unicode-compatible PHP programs work everywhere. One of PHP's assets is its large install base, yet the large majority of those installs is completely unsuited for Unicode work. At the time of writing this article, the latest PHP (5.0.4) still does not enable the Mbstring extension by default.

A trickier example: in Drupal 4.6.0 we depend on the Perl-compatible Regular Expression Library's support for Unicode and UTF-8. This was supposedly present since PHP 4.1 (exception: since PHP 4.2.3 on Windows). But actual testing shows that it took until PHP 4.3.3 for this library to know how to deal correctly with UTF-8 and the full Unicode range. But even now, PHP still has the ability to use the system-provided PCRE library, which can still be compiled without UTF-8 support. This can result in unsupported installs even for those using the latest PHP version.
When you use Mbstring overloading, you can no longer easily work with strings of binary data. Mbstring overloading sounds nice in theory, as it gives you smarter string functions for free without having to adapt your code. However, this feature denies a basic fact: text strings are fundamentally different from binary data. If this sounds strange to you, consider this:
- Binary data requires no meta-information about its encoding and can be passed around freely. Operations on two byte arrays are guaranteed to work. Text, on the other hand, is always encoded in a particular way. Text operations can only work if the encoding is known and verified to be the same for all operands involved.
- Binary data can contain arbitrary bits, while most text encodings have a much more limited syntax. Take a look at UTF-8's bit patterns for example. However, even plain US-ASCII text has historically had the limitation that it may not contain the NULL character.
- Binary data has no intrinsic semantic meaning, while text does. Many operations (like case conversion) only make sense on text, while other operations become much more complicated (e.g. text sorting needs to take local conventions into account). Specifically, there are a lot of Unicode algorithms for advanced text processing (e.g. the Bidirectional Algorithm for handling text with mixed writing directions).
Due to the fact that text has been 8-bit encoded for a long time, a lot of programmers don't think twice about using text functions for dealing with binary data and vice-versa. But this assumption is no longer valid today.

If Mbstring overloading is enabled and a PHP programmer wants to perform operations on binary data, (s)he has to temporarily trick PHP into using a simple 8-bit encoding (like ISO-8859-1). Quite possibly, locale settings have to be changed back and forth as well. This results in bloated, complicated code.
PHP's string functions don't form a clean, consistent API. There is no consistent naming convention (e.g. substr(), str_replace(), convert_cyr_string(), parse_str(), sprintf(), ...).

There are also a bunch of hodge-podge functions which are only useful in very specific situations and/or which are tied to a particular encoding (e.g. utf8_encode()) or locale (e.g. ucfirst()).

Finally, though some functions take an encoding argument to allow for some encoding support, this is rare and inconsistent. For example, while the html_entities() function supports several encodings, the utility function get_html_translation_table() which fetches its translation table does not.
PHP's locale mechanism is completely platform-dependant and offers no guarantees. The locale identifiers passed to setlocale() differ completely between Windows and Unix platforms, but even between similar Unix platforms there is no guarantee of which locales are available. The dependency of PHP on system locales also means that you are restricted to whatever encodings the system locales are available in.
PHP's XML parser is notorious for violating the specifications when it comes to encodings. In today's web, XML is everywhere in the form of XHTML, RSS feeds, OPML, etc. Being able to parse XML correctly is essential to any PHP application. A significant portion of the XML specification talks about encodings and how to deal with them, but PHP does not implement them correctly.

For example, if an XML document starts with a UTF-8 signature (in the form of the byte-order mark), PHP5's parser will die if it is told the document is in UTF-8 encoding. Similar simple, but critical bugs have had to be worked around by PHP programmers in the past. Before PHP5, absolutely no encoding autodetection was present in the XML parser: this had to be done by the code invoking the parser.
Mbstring is a pragmatic library, not a fully featured Unicode solution. Example limitations include not being able to specify characters beyond U+FFFF for some functions (e.g. mb_substitute_character()) or the way mb_strwidth() seems to be hardcoded for Japanese only (there are no zero widths for combining accents?).

All of these problems together mean that it is very hard at the moment to write PHP software which can support encodings and Unicode. Even worse, if this software has to run on a typical PHP install, then you can forget about implementing anything more than simple pass-through behaviour as far as text is concerned.

Proposed solution

Unfortunately, PHP is very hot on backwards compatibility, so significant changes to the existing string API are pretty much out of the question. New types and APIs need to be introduced which offer a complete, consistent and flexible solution for dealing with encodings and Unicode.

PHP needs a new Unicode text string type which is separate from the classic byte string. This type, let's call it ustring, would represent a string of Unicode text.

Internally, it would be stored using one of the UTF's. In the interests of internal processing efficiency, UTF-16 is probably the best choice, but UTF-8 can be considered as well as it is the most popular UTF on the web today. In that case, outputting UTF-8 could be done without any conversion. On the other hand, the complicated bit patterns and variability of UTF-8 mean that it is harder to find character boundaries and such. Looking at how languages like Perl and Python approach this is a good idea. After all, they've had Unicode strings for quite some time.

To distinguish ustrings from plain strings when defined, a syntax similar to C could be introduced, for example U"This is a Unicode string". This syntax would support \u####, \U######## and \x{#..} notation for defining characters by codepoint inside the string.

Using the {} operator on a ustring would return ints, not chars. To reduce confusion, perhaps a uchar type could be introduced specifically for handling Unicode codepoints. As the Unicode codespace is only 21-bit wide, there would be subtle differences between uchar and int, though both would probably be stored as 32-bit.

For backwards compatibility, plain quoted strings would remain used for byte strings, although it might be interesting to define a B"This is a byte string" notation, while providing a configurable option for choosing which type of string is assumed when there is no prefix. As Unicode usage would become more widespread, it would be nice to not have to litter your code with U's everywhere.

Though the internal encoding would be fixed to one of the UTF's, the external encoding might vary (and would be configurable through an API). When casting a ustring to a string, a conversion would take place from the internal encoding to the external one, and vice-versa. It remains to be seen which type takes precedence when both are mixed together (e.g. $string = U"Unicode" . "Bytes").
PHP needs a new Unicode string API. This API would contain a selection of functions from both the plain String API as well as the Mbstring API, but would have a simpler and more logical naming convention. For example, making all ustring functions start with ustr_. Each of these would accept a ustring where the current ones accept a plain string.

External APIs, like the PCRE library, could choose whether to accept string, ustring or both. For example for PCRE, it makes sense to replace the PHP-proprietary /u modifier with a simple string type check instead.
PHP needs to ensure that a baseline set of encoding-related functions are always available. I believe the Iconv extension is now standard since PHP5, but things like complete UTF-8 support in PCRE are important too. This allows programmers to write their code in a straightforward fashion without having to check for a gazillion exceptions or exotic configurations.
PHP needs an independent locale library across all platforms This ensures consistent handling of locales and no longer limits PHP to what the platform supports. The International Components for Unicode (ICU) are an excellent candidate.

The choice to limit this new string functionality to Unicode strings might seem elitist: after all, the idea of Unicode is not to get rid of other encodings, but merely to ensure compatibility. Non-Unicode encodings will keep fulfilling an important role in the years to come. On the other hand, as Unicode is guaranteed to be a perfect intermediate format, it makes sense to use it for internal string handling. It limits the functionality that has to be dealt with and creates a common baseline to work with.

Finally, as the original String and Mbstring APIs would not be altered by these changes, programmers would be free to use the 'old school' way of dealing with strings. They would simply not be able to take advantage of the cleaner API and consistent locales.

PHP, Unicode and ostriches.

2005-03-25T00:00:00+01:00

Update: I've written a follow-up post that describes how I would like PHP's encoding support to be.

As the resident encoding geek on the Drupal team, it's usually my job to make sure Drupal handles encodings and Unicode correctly. I don't mind doing this, but PHP doesn't exactly make it easy. With the new search.module for Drupal 4.6 being Unicode-aware, this has become very obvious, as we've had to bump up the minimum required version of PHP to 4.3.3. The UTF-8 support in the Perl-compatible regular expressions in PHP 4.3.2 and earlier is completely broken. And now I've had a bug report about someone on PHP 4.3.8 who still had problems getting it to work.

I don't know why exactly, but as far as encodings go PHP is still in the stone-age. This is odd, as you'd expect a web-oriented scripting language to have excellent support for sharing and exchanging textual information. There is a multi-byte string extension available, but it's not available on 90% of PHP hosts out there, and it's more of a black-box library anyway: it does not present you your strings as Unicode character codepoints, but still as an array of bytes. Furthermore, if you actually enable the mbstring overrides, you lose the ability to work with bytes at will. Apparently, the PHP team still hasn't figured out that bytes and characters are not the same. The other extensions which deal with encodings (iconv, recode) are also unavailable on the majority of PHP installs out there.

This means that if you want to make a PHP application which supports any language and runs on the average PHP host out there, that there's only one option: use UTF-8 internally, and write your own functions for string truncation, email header encoding, validation, etc. Using UTF-8 ensures that you only have one encoding to worry about and because it's Unicode it is guaranteed to be able to represent any language. Of course, you will no longer be able to do something simple as upper/lowercasing a string, as these PHP functions don't take UTF-8 at all.

What PHP needs is Unicode string support in the core, along with a good library of useful functions for handling the very large Unicode character range efficiently. ASP, Perl, Python, Java all have it... for me, it's the only thing that would've made PHP5 worth to upgrade to.

It's as if the entire PHP team has stuck their head in the ground, hoping that all this Unicode stuff will somehow blow over. It won't.

Sprankle Character Map

2004-12-23T00:00:00+01:00

It hit me a while ago that entering characters which are not available on your keyboard or through your IME is much too complicated. Usually it involves opening up some character map, scrolling through hundreds of symbols to find the one you need and copy/pasting it into the application of your choice.

Not very handy. Enter Sprankle Character Map. The idea is to hit a special key combination when typing (WIN + S for Sprankle) which pops up a character map where you are typing. You then type a symbol to find similar characters and choose one from the list using either numbers or arrows + space. Here's how it looks.

This is just a prototype, but it demonstrates the idea nicely and it's actually pretty usable. Certainly better than firing up a full character map every time.

Notes:

Sprankle is a Unicode-application and only runs on Windows 2000/XP.
The map appears on top of the current text field. For large, multi-line text fields this is far from ideal. It would be better to have it appear at the current caret position.
Sprankle doesn't work on Mozilla Firefox (or other applications that do special keyboard processing). If anyone has an idea on how to fix this, please tell.
It might be better to implement Sprankle as a real IME so it integrates completely with the text field. I have no idea how to do this though, but I'm sure MSDN has some documentation about it. The downside would be that it might not work in combination with existing IMEs (e.g. for Japanese).
Many of the symbols in the character set are not present in most fonts. Sprankle currently looks for Arial Unicode MS, the universal font that comes with XP and Office.
It might be cool to make a JavaScript version of this, so it can be integrated on websites with CMSes like Drupal.
You can customize Sprankle's character sets by editing sprankle.txt (UTF-16LE encoded). Right now it covers most of the Latin characters, basic Greek plus some math symbols.

Download Sprankle (source + win32 binary).

UFPDF: Unicode/UTF-8 extension for FPDF

2004-09-01T00:00:00+02:00

Note: I wrote UFPDF as an experiment, not as a finished product. If you have problems using it, don't bug me for support. Patches are welcome though, but I don't have much time to maintain this.

FPDF is a PHP class for generating PDF files on-the-fly. Unfortunately it does not support Unicode. So I've coded UFPDF, an extension of FPDF which accepts input in UTF-8.

Only TrueType fonts are supported for now. To embed .TTF files, you need to extract the font metrics and build the required tables using the provided utilities (see README.txt). Included is a modified version of TTF2PT1 which extracts the Unicode glyph info.

UFPDF works the same as FPDF, except that all text is in UTF-8, so consult the FPDF documentation for usage.

Download UFPDF Example PDF

UTF-8 conversion support for mIRC

2004-07-13T00:00:00+02:00

mIRC's lack of UTF-8 support has been an issue for quite some time. The author promised to 'look at it', but in the meantime, chatting in UTF-8 is not possible. This is problematic for any language that uses more than the occasional accented letter.

So I decided to make a temporary fix myself. The result is a flexible conversion mechanism between UTF-8 and the ANSI codepages. The user sees and types regular ANSI characters, but all data which is sent to and received from the IRC server is UTF-8 encoded. You are still limited to one ANSI codepage though: making mIRC support real Unicode is not possible without an mIRC rewrite.

The script performs a real UTF-8 encoding/decoding, so unlike a simple 'find and replace' approach, characters which do not fit into the current codepage are indicated as such.

I included conversion tables for all of the Windows ANSI codepages:

1250 (ANSI - Central Europe)
1251 (ANSI - Cyrillic)
1252 (ANSI - Western Europe / Latin I)
1253 (ANSI - Greek)
1254 (ANSI - Turkish)
1255 (ANSI - Hebrew)
1256 (ANSI - Arabic)
1257 (ANSI - Baltic)
1258 (ANSI/OEM - Viet Nam)

There is also a little utility (with source) for generating conversion tables for more codepages.

For instructions on how to use it, check the top of the utf-8.mrc file. You can download the script here (19 KB).

Important: This script is provided as-is without any guarantees. Use it if you like it, but don't bug me if you can't get it to work. If you find bugs, feel free to report them, but try to give a little more information than just 'it doesn't work'.

My ideal text editor

2004-02-19T00:00:00+01:00

Out of recommendation from a certain evil norwegian, I gave EditPad Pro a whirl. Took me 10 minutes to remove it again.

Am I too picky? Maybe. Here's what I want from a text-editor (in no particular order):

Runs on Windows 2000. Vent your anti-Microsoft anger somewhere else, I use Windows every day and I'm not likely to switch any time soon.
Native Unicode and UTF-8 support. This is 2004. Unicode has been around for ages, and I see no reason why I should occupy myself with encoding issues. I deal with multiple languages, so Unicode is the only logical choice. Unicode compatibility is no longer a problem thanks to the Microsoft Layer for Unicode (from now on I will shoot everyone who refers to a byte as a 'character'). Note: automatic conversion between Unicode and the current ANSI codepage doesn't cut it (that's what Editpad Pro seems to do).
IME-friendly, with bonus points for an integrated IME. Sometimes I type Japanese, and it requires indirect input and conversion of typed characters. Certain editors I've encountered do weird things which prevents the IME from doing its job, so that's why I mention it explicitly.
Advanced editing for web-development. I do a lot of HTML, CSS, PHP, SQL and JavaScript, so anything that can make coding easier is a plus. The least I want is syntax highlighting, but intelligent auto-completion, validation, previewing and other visual cues are very handy too.
Good user-interface. This one shouldn't really be necessary to mention, but so many programs seem to miss the point here: a program should be easy to use. I'm not going to go down to specifics, there are a lot of good references on the subject around. Because I'm picky as hell, reconfigurable toolbars, panels and hotkeys score good too. Don't confuse this item with the next one, which is:
Nice to look at. I don't need menus that whiz by, flashy windows with skins or other novelty visual effects, but that doesn't mean my applications can be butt-ugly. Things such as proper spacing and margins, aesthetic proportions and contemporary looks are big pluses.

I don't think these are such crazy demands, so if anyone who has suffered through this rant up to now knows a program which satisfies these conditions, please post a link here ;).

Update: I've settled for Notepad2 for now. It's a small, functional, neat editor and it's open-source too.

Drupal XML weirdness

2003-12-14T00:00:00+01:00

So I was looking at Drupal's import page and noticed non-ascii characters looked quite botched. Some source viewing revealed the input had apparently been utf8 encoded twice (that is UTF'd, then assumed to be ISO-8859-1 and UTF'd again).

The source was an XML feed in UTF8 which looked perfectly fine. I went over import.module and couldn't see any specific UTF8 encoding. Some testing revealed that PHP's XML parser was the culprit:

$xmlfile = "UTF8 v\xC3\xA3lue"; function handler_data($parser, $data) { print "Data: $data\r\n"; } $xml_parser = xml_parser_create(); xml_set_character_data_handler($xml_parser, "handler_data"); xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING, "utf-8"); xml_parse($xml_parser, $xmlfile, 1); xml_parser_free($xml_parser); ?>

The input XML contains the word vãlue and is in UTF8 format. The output encoding is specified as UTF8 as well, so you would expect PHP to print out the value unchanged (i.e. with 2 bytes for the ã character). Not so... PHP incorrectly treats the input as ISO-8859-1 and re-UTF's the input, resulting in 4 bytes for the ã character.

This is strange because PHP claims to support UTF-8 source encoding.