Hackery, Math & Design

Steven Wittens i

Drupal XML weirdness

So I was looking at Drupal's import page and noticed non-ascii characters looked quite botched. Some source viewing revealed the input had apparently been utf8 encoded twice (that is UTF'd, then assumed to be ISO-8859-1 and UTF'd again).

The source was an XML feed in UTF8 which looked perfectly fine. I went over import.module and couldn't see any specific UTF8 encoding. Some testing revealed that PHP's XML parser was the culprit:

<?php
$xmlfile = "<?xml version=\"1.0\" encoding=\"utf-8\" ?><tag>UTF8 v\xC3\xA3lue</tag>";

function handler_data($parser, $data) {
  print "Data: $data\r\n";
}

$xml_parser = xml_parser_create();
xml_set_character_data_handler($xml_parser, "handler_data");
xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING, "utf-8");
xml_parse($xml_parser, $xmlfile, 1);
xml_parser_free($xml_parser);
?>

The input XML contains the word vãlue and is in UTF8 format. The output encoding is specified as UTF8 as well, so you would expect PHP to print out the value unchanged (i.e. with 2 bytes for the ã character). Not so... PHP incorrectly treats the input as ISO-8859-1 and re-UTF's the input, resulting in 4 bytes for the ã character.

This is strange because PHP claims to support UTF-8 source encoding.

Dev  Drupal  Unicode
This article contains graphics made with WebGL, which your browser does not seem to support.
Try Google Chrome or Mozilla Firefox. ×