Home

Drupal XML weirdness

Dec 14, 2003

So I was looking at Drupal's import page and noticed non-ascii characters looked quite botched. Some source viewing revealed the input had apparently been utf8 encoded twice (that is UTF'd, then assumed to be ISO-8859-1 and UTF'd again).

The source was an XML feed in UTF8 which looked perfectly fine. I went over import.module and couldn't see any specific UTF8 encoding. Some testing revealed that PHP's XML parser was the culprit:

<?php
$xmlfile = "<?xml version=\"1.0\" encoding=\"utf-8\" ?><tag>UTF8 v\xC3\xA3lue</tag>";

function handler_data($parser, $data) {
print "Data: $data\r\n";
}

$xml_parser = xml_parser_create();
xml_set_character_data_handler($xml_parser, "handler_data");
xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING, "utf-8");
xml_parse($xml_parser, $xmlfile, 1);
xml_parser_free($xml_parser);
?>

The input XML contains the word vãlue and is in UTF8 format. The output encoding is specified as UTF8 as well, so you would expect PHP to print out the value unchanged (i.e. with 2 bytes for the ã character). Not so... PHP incorrectly treats the input as ISO-8859-1 and re-UTF's the input, resulting in 4 bytes for the ã character.

This is strange because PHP claims to support UTF-8 source encoding.

-

Dec 27, 2003 Anonymous

What happens when you leave out the set_option and define it with the parser_create? like:

$xml_parser = xml_parser_create("UTF-8");

also, (far-fetched, i know) what about the caps? utf/UTF ? shouldn't be, but you never know :p

Input vs Output

Dec 29, 2003 Steven

Actually that set_option is there to specify the output encoding, not the input... but you've brought the solution to my attention: The parameter for xml_parser_create is indeed the input encoding that PHP4 uses.
PHP4 does not extract the input encoding automatically, but requires it to be specified explicitly.

PHP5 does do this automatically (and ignores any encoding given to xml_parser_create).

Post new comment

Note: all posts containing spam will be removed.
The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <b> <dd> <dl> <dt> <i> <li> <ol> <u> <ul> <img> <em> <p> <br> <span> <div> <h2> <h3> <abbr> <small> <table> <tr> <td> <strong> <acronym> <th> <blockquote>
  • Lines and paragraphs break automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

Recent comments

Images