Drupal XML weirdness
So I was looking at Drupal's import page and noticed non-ascii characters looked quite botched. Some source viewing revealed the input had apparently been utf8 encoded twice (that is UTF'd, then assumed to be ISO-8859-1 and UTF'd again).
The source was an XML feed in UTF8 which looked perfectly fine. I went over import.module
and couldn't see any specific UTF8 encoding. Some testing revealed that PHP's XML parser was the culprit:
<?php
$xmlfile = "<?xml version=\"1.0\" encoding=\"utf-8\" ?><tag>UTF8 v\xC3\xA3lue</tag>";
function handler_data($parser, $data) {
print "Data: $data\r\n";
}
$xml_parser = xml_parser_create();
xml_set_character_data_handler($xml_parser, "handler_data");
xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING, "utf-8");
xml_parse($xml_parser, $xmlfile, 1);
xml_parser_free($xml_parser);
?>
The input XML contains the word vãlue and is in UTF8 format. The output encoding is specified as UTF8 as well, so you would expect PHP to print out the value unchanged (i.e. with 2 bytes for the ã character). Not so... PHP incorrectly treats the input as ISO-8859-1 and re-UTF's the input, resulting in 4 bytes for the ã character.
This is strange because PHP claims to support UTF-8 source encoding.