Acko.net
25 Mar 2005

PHP, Unicode and ostriches.

Update: I've written a follow-up post that describes how I would like PHP's encoding support to be.

As the resident encoding geek on the Drupal team, it's usually my job to make sure Drupal handles encodings and Unicode correctly. I don't mind doing this, but PHP doesn't exactly make it easy. With the new search.module for Drupal 4.6 being Unicode-aware, this has become very obvious, as we've had to bump up the minimum required version of PHP to 4.3.3. The UTF-8 support in the Perl-compatible regular expressions in PHP 4.3.2 and earlier is completely broken. And now I've had a bug report about someone on PHP 4.3.8 who still had problems getting it to work.

I don't know why exactly, but as far as encodings go PHP is still in the stone-age. This is odd, as you'd expect a web-oriented scripting language to have excellent support for sharing and exchanging textual information. There is a multi-byte string extension available, but it's not available on 90% of PHP hosts out there, and it's more of a black-box library anyway: it does not present you your strings as Unicode character codepoints, but still as an array of bytes. Furthermore, if you actually enable the mbstring overrides, you lose the ability to work with bytes at will. Apparently, the PHP team still hasn't figured out that bytes and characters are not the same. The other extensions which deal with encodings (iconv, recode) are also unavailable on the majority of PHP installs out there.

This means that if you want to make a PHP application which supports any language and runs on the average PHP host out there, that there's only one option: use UTF-8 internally, and write your own functions for string truncation, email header encoding, validation, etc. Using UTF-8 ensures that you only have one encoding to worry about and because it's Unicode it is guaranteed to be able to represent any language. Of course, you will no longer be able to do something simple as upper/lowercasing a string, as these PHP functions don't take UTF-8 at all.

What PHP needs is Unicode string support in the core, along with a good library of useful functions for handling the very large Unicode character range efficiently. ASP, Perl, Python, Java all have it... for me, it's the only thing that would've made PHP5 worth to upgrade to.

It's as if the entire PHP team has stuck their head in the ground, hoping that all this Unicode stuff will somehow blow over. It won't.

Tags:
06 Apr 02:35

I never needed to work with

by Anonymous

I never needed to work with strings as bytes so mbstring works for me (but the way PCRE works with Unicode chars is ridiculous).

Essentially both PHP and Ruby drag severly behind in multibyte string support (Perl has it wonderfully done, Python has it, Java has it... good lord who doesn't!)

However, I think that if you want your PHP app to run with UTF-8 you can write this in the readme for instance "If you need multibyte support, make sure you have mbstring extension in your PHP". This is a fair deal (and I haven't seen many places lately where this extension wasn;t included)

06 Apr 19:06

Bytes vs characters

by Steven

The problem is that with mbstring, all your strings are encoded. You can no longer access them as bytes, because characters and bytes are the same in PHP.

Text strings are not the same as byte arrays, php needs to implement this.

03 Aug 00:30

I agree

by Anonymous

As I speak, I am writing a homebrew UTF-8 to UTF-16 conversion function for PHP... how exciting

03 Aug 05:51

UFPDF

by Steven

There is code for that in UFPDF, found elsewhere on this site.

17 Sep 17:45

I wholeheartedly agree

by Wiseman

I agree. Any language's one and only character set should be Unicode 4.1, encoded in either UTF-16 or UTF-32 for faster string operations (UTF-16 ok as long as you contemplate surrogate pairs). Then if you want to use any legacy character set (you shouldn't, but still), you should be able to configure HTTP output conversion, CLI output conversion, and use iconv.

Nowadays, you can have a reasonable degre of Unicode support in PHP, but it's far from perfect. You have to use mbstring, enable all overrides, add the /u flag to PCRE patterns if you use PCRE, and make sure you don't use { } to address bytes of a string unless you know what you're doing.

09 Mar 01:26

Unicode URL parameters in PHP

One requirement I have had to handle correctly in PHP is Unicode URL query string parameters.

This is covered here.

Related material can be found at Escape from Atamyrat. Although PHP does now support the "mb...." functions for mutlti-byte strings, these are not included in PHP by default and therefore may not be available on a third party hosted system where you do not have full conntrol over the build/version/config etc. So all the Unicode/utf8 handling code which has been/will be posted will not required this.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <b> <dd> <dl> <dt> <i> <li> <ol> <u> <ul> <img> <em> <p> <br> <span> <div> <h2> <h3> <abbr> <small> <table> <tr> <td> <strong> <acronym> <th> <blockquote>
  • Lines and paragraphs break automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options