简体   繁体   中英

PHP: xml_parser “Mismatched tag”-error when parsing HTML (auto-closing tags as <img>)?

I want to parse HTML using PHPs. I used xml_parser for it, but it can't cope with auto-closing tags as <img> .

For example, the following HTML snippet produces a 'Mismatched tag' error when it reaches the closing tag </a> :

<a>
  <img src="URL"><br>
</a>

Obviosly, the reason is: xml_parser() doesn't know that the tags <img> and <br> do not need to be closed (as they are self-closing automatically).

I know that I could rewrite the HTML to <img src="URL"/><br/> to make the parser happy. However, I want the parser to correctly process those HTML correctly instead as the above variation would be valid HTML.

So I either need to tell the parser - within the onOpeningTag - if this tag is auto-closing. Is this possible somehow? An alternative could be to tell the parser a list of the self-closing tag names. However, I didn't find any function for that. So it might also be the case that 'HTML' isn't supported by this parser.

A acceptable solution might be to disable the tag mismatch check at all (or implement an HTML-compatible version myself).

However, there could be a HTML-specific version in PHP which I overlooked. Any suggestions which other simple parser implementations I could use?

Here's what I have so far:

<?php

// Command Line Parsing...
$file = $argv[1];


// Tag Handler functions
function onOpeningTag($parser, $name, $attrs) {
  echo "OPEN: $name\n";
}

function onClosingTag($parser, $name) {
  echo "CLOSE: $name\n";
}

function onContent($parser, $text) {
  echo "TEXT (LEN:".strlen($text).")\n";
}

// Parser...
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "onOpeningTag", "onClosingTag");
xml_set_character_data_handler($xml_parser, "onContent");

if (!($fp = fopen($file, "r"))) die("Could not open file '$file'.\n");
while ($data = fread($fp, 4096)) {
  if (!xml_parse($xml_parser, $data, feof($fp))) {
    die(sprintf("XML error: %s at line %d\n",
      xml_error_string(xml_get_error_code($xml_parser)),
      xml_get_current_line_number($xml_parser)));
  }
}
fclose($fp);

xml_parser_free($xml_parser);


?>

You want to parse HTML with an XML parser and this is prone to cause headaches. XML is far stricter than HTML and you'll always run into problems like this. If your HTML is not huge - like tens of MBs, but rather a normal web page you can just use DOM - http://php.net/manual/en/book.dom.php .

$dom = new DOMDocument();
$dom->loadHtml($html);
$lists = $dom->getElementsByTagName('ul');
// bla bla bla

My suggestion is to try a specialised library for HTML parsing. Here are some suggesions:

May the force be with you!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM