简体   繁体   English

PHP xml_parse_into_struct因“

[英]PHP xml_parse_into_struct breaks because of &acute

I am trying to converting xml into PHP array. 我正在尝试将xml转换为PHP数组。 the problem is xml_parse_into_struct only converts the string before it encounters the &acute . 问题是xml_parse_into_struct仅在遇到&acute之前转换字符串。 I have the following code. 我有以下代码。

$xmlStr     = file_get_contents($url);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING   , "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING      , 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE        , 1);
xml_parse_into_struct($parser, trim($contents)          , $xml_values);

when viewed as HTML it is like there´s . 作为HTML查看时它像there´s Any help will be much appreciated. 任何帮助都感激不尽。

Have you tried SimpleXML 您是否尝试过SimpleXML

eg 例如

libxml_use_internal_errors(true);
$xml_string = file_get_contents($url);
$xml_string = html_entity_decode($xml_string, ENT_QUOTES, "utf-8");
$xml_data = new SimpleXMLElement($xml_string);
var_dump($xml_data); // displays object array

´ is an unknown named entity in XML, you can no load it as plain XML, only as (X)HTML. 是XML中未知的命名实体,您不能将其作为纯XML加载,只能作为(X)HTML加载。

You can use DOMDocument to load HTML, but by default this will "repair" the document to a complete html file: 您可以使用DOMDocument加载HTML,但是默认情况下,这会将文档“修复”为完整的html文件:

$html = '<p>there&acute;s an acute</p>';

$dom = new DOMDocument();
$dom->loadHtml($html);

echo $dom->saveXml();

Output: 输出:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>there&#xB4;s an acute</p></body></html>

You can see that the &acute; 您会看到&acute; got converted to its numeric encoding. 转换为数字编码。 This is valid in XML, the named entity is not (Not without a DTD, XSD). 这在XML中有效,命名实体无效(不是没有DTD,XSD的实体)。

Here is another approach, you can decode all the named entities to utf-8 using string functions: 这是另一种方法,您可以使用字符串函数将所有命名的实体解码为utf-8:

$html = '<p>there&acute;s an acute</p>';

$namedEntities = array_flip(
  array_diff(
    get_html_translation_table(HTML_ENTITIES, ENT_NOQUOTES, 'UTF-8'),
    get_html_translation_table(HTML_SPECIALCHARS, ENT_NOQUOTES, 'UTF-8')
  )
);
$xml = strtr($html, $namedEntities);

$dom = new DOMDocument();
$dom->loadXml($xml);

echo $dom->saveXml();

Output: 输出:

<?xml version="1.0"?>
<p>there&#xB4;s an acute</p>

This will work even with the old extension you're using at the moment: 即使您目前正在使用旧的扩展程序,该功能也可以使用:

$html = '<p>there&acute;s an acute</p>';

$namedEntities = array_flip(
  array_diff(
    get_html_translation_table(HTML_ENTITIES, ENT_NOQUOTES, 'UTF-8'),
    get_html_translation_table(HTML_SPECIALCHARS, ENT_NOQUOTES, 'UTF-8')
  )
);

$xml = strtr($html, $namedEntities);

$parser = xml_parser_create ('utf-8');
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1);
xml_parse_into_struct($parser, $xml, $xml_values);

var_dump($xml_values);

Output: 输出:

array(1) {
  [0]=>
  array(4) {
    ["tag"]=>
    string(1) "p"
    ["type"]=>
    string(8) "complete"
    ["level"]=>
    int(1)
    ["value"]=>
    string(17) "there´s an acute"
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM