简体   繁体   中英

PHP xml_parse_into_struct breaks because of &acute

I am trying to converting xml into PHP array. the problem is xml_parse_into_struct only converts the string before it encounters the &acute . I have the following code.

$xmlStr     = file_get_contents($url);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING   , "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING      , 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE        , 1);
xml_parse_into_struct($parser, trim($contents)          , $xml_values);

when viewed as HTML it is like there´s . Any help will be much appreciated.

Have you tried SimpleXML

eg

libxml_use_internal_errors(true);
$xml_string = file_get_contents($url);
$xml_string = html_entity_decode($xml_string, ENT_QUOTES, "utf-8");
$xml_data = new SimpleXMLElement($xml_string);
var_dump($xml_data); // displays object array

´ is an unknown named entity in XML, you can no load it as plain XML, only as (X)HTML.

You can use DOMDocument to load HTML, but by default this will "repair" the document to a complete html file:

$html = '<p>there&acute;s an acute</p>';

$dom = new DOMDocument();
$dom->loadHtml($html);

echo $dom->saveXml();

Output:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>there&#xB4;s an acute</p></body></html>

You can see that the &acute; got converted to its numeric encoding. This is valid in XML, the named entity is not (Not without a DTD, XSD).

Here is another approach, you can decode all the named entities to utf-8 using string functions:

$html = '<p>there&acute;s an acute</p>';

$namedEntities = array_flip(
  array_diff(
    get_html_translation_table(HTML_ENTITIES, ENT_NOQUOTES, 'UTF-8'),
    get_html_translation_table(HTML_SPECIALCHARS, ENT_NOQUOTES, 'UTF-8')
  )
);
$xml = strtr($html, $namedEntities);

$dom = new DOMDocument();
$dom->loadXml($xml);

echo $dom->saveXml();

Output:

<?xml version="1.0"?>
<p>there&#xB4;s an acute</p>

This will work even with the old extension you're using at the moment:

$html = '<p>there&acute;s an acute</p>';

$namedEntities = array_flip(
  array_diff(
    get_html_translation_table(HTML_ENTITIES, ENT_NOQUOTES, 'UTF-8'),
    get_html_translation_table(HTML_SPECIALCHARS, ENT_NOQUOTES, 'UTF-8')
  )
);

$xml = strtr($html, $namedEntities);

$parser = xml_parser_create ('utf-8');
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1);
xml_parse_into_struct($parser, $xml, $xml_values);

var_dump($xml_values);

Output:

array(1) {
  [0]=>
  array(4) {
    ["tag"]=>
    string(1) "p"
    ["type"]=>
    string(8) "complete"
    ["level"]=>
    int(1)
    ["value"]=>
    string(17) "there´s an acute"
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM