简体   繁体   English

PHP:解析HTML时xml_parser“标签不匹配”错误(自动关闭标签为 <img> )?

[英]PHP: xml_parser “Mismatched tag”-error when parsing HTML (auto-closing tags as <img>)?

I want to parse HTML using PHPs. 我想使用PHP解析HTML。 I used xml_parser for it, but it can't cope with auto-closing tags as <img> . 我为此使用了xml_parser,但是它不能应付<img>自动关闭标签。

For example, the following HTML snippet produces a 'Mismatched tag' error when it reaches the closing tag </a> : 例如,以下HTML代码段到达结束标记</a>时会产生“不匹配的标记”错误:

<a>
  <img src="URL"><br>
</a>

Obviosly, the reason is: xml_parser() doesn't know that the tags <img> and <br> do not need to be closed (as they are self-closing automatically). 显然,原因是:xml_parser()不知道标签<img><br>不需要关闭(因为它们是自动关闭的)。

I know that I could rewrite the HTML to <img src="URL"/><br/> to make the parser happy. 我知道我可以将HTML重写为<img src="URL"/><br/>以使解析器满意。 However, I want the parser to correctly process those HTML correctly instead as the above variation would be valid HTML. 但是,我希望解析器正确地正确处理这些HTML,因为上述变体将是有效的HTML。

So I either need to tell the parser - within the onOpeningTag - if this tag is auto-closing. 所以我要么需要告诉解析器-在onOpeningTag中-此标记是否是自动关闭的。 Is this possible somehow? 这有可能吗? An alternative could be to tell the parser a list of the self-closing tag names. 另一种选择是告诉解析器自动关闭标签名称的列表。 However, I didn't find any function for that. 但是,我没有找到任何功能。 So it might also be the case that 'HTML' isn't supported by this parser. 因此,这种解析器可能不支持“ HTML”。

A acceptable solution might be to disable the tag mismatch check at all (or implement an HTML-compatible version myself). 可以接受的解决方案可能是完全禁用标签不匹配检查(或自己实现HTML兼容版本)。

However, there could be a HTML-specific version in PHP which I overlooked. 但是,我可能忽略了PHP中特定于HTML的版本。 Any suggestions which other simple parser implementations I could use? 有什么建议可以使用其他简单的解析器实现吗?

Here's what I have so far: 这是我到目前为止的内容:

<?php

// Command Line Parsing...
$file = $argv[1];


// Tag Handler functions
function onOpeningTag($parser, $name, $attrs) {
  echo "OPEN: $name\n";
}

function onClosingTag($parser, $name) {
  echo "CLOSE: $name\n";
}

function onContent($parser, $text) {
  echo "TEXT (LEN:".strlen($text).")\n";
}

// Parser...
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "onOpeningTag", "onClosingTag");
xml_set_character_data_handler($xml_parser, "onContent");

if (!($fp = fopen($file, "r"))) die("Could not open file '$file'.\n");
while ($data = fread($fp, 4096)) {
  if (!xml_parse($xml_parser, $data, feof($fp))) {
    die(sprintf("XML error: %s at line %d\n",
      xml_error_string(xml_get_error_code($xml_parser)),
      xml_get_current_line_number($xml_parser)));
  }
}
fclose($fp);

xml_parser_free($xml_parser);


?>

You want to parse HTML with an XML parser and this is prone to cause headaches. 您想使用XML解析器解析HTML,这很容易引起麻烦。 XML is far stricter than HTML and you'll always run into problems like this. XML比HTML严格得多,您总是会遇到这样的问题。 If your HTML is not huge - like tens of MBs, but rather a normal web page you can just use DOM - http://php.net/manual/en/book.dom.php . 如果您的HTML大小不是很大(例如几十MB,而是一个普通的网页),则可以使用DOM- http://php.net/manual/en/book.dom.php

$dom = new DOMDocument();
$dom->loadHtml($html);
$lists = $dom->getElementsByTagName('ul');
// bla bla bla

My suggestion is to try a specialised library for HTML parsing. 我的建议是尝试使用专门的HTML解析库。 Here are some suggesions: 以下是一些建议:

May the force be with you! 愿原力与你同在!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM