简体   繁体   中英

Scraping malformed HTML with PHP DomDocument

I'm using PHP DomDocument + XPath for scraping various web pages. I found that in some cases DomDocument even unable to load HTML, just returns an empty result. For example, page contains two body tags or has wrong DOCTYPE declaration. I've tried to preprocess malformed HTML with PHP Tidy and it really helps but PHP Tidy is very slow!

I don't want to use any third-party libraries like Simple Html Dom Parser

Please advise how to deal with malformed HTML using PHP DomDocument. Should I write a custom regexp to fix broken HTML before sending to DomDocument? Maybe I missed some settings for PHP DomDocument?

UPD

$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, 'http://example.com');
$result = curl_exec($ch);
curl_close($ch);

$dom = new DomDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($result);
libxml_clear_errors();
var_dump($dom);

$xpath = new DomXPath($dom);
$nodes = $xpath->query(".//*[@id='content']/ul/li/div[2]/h3/a");

var_dump($nodes); // Nothing

Result of var_dump($dom);

object(DOMDocument)#25 (34) {
  ["doctype"]=>
  string(22) "(object value omitted)"
  ["implementation"]=>
  string(22) "(object value omitted)"
  ["documentElement"]=>
  NULL
  ["actualEncoding"]=>
  string(5) "UTF-8"
  ["encoding"]=>
  string(5) "UTF-8"
  ["xmlEncoding"]=>
  string(5) "UTF-8"
  ["standalone"]=>
  bool(true)
  ["xmlStandalone"]=>
  bool(true)
  ["version"]=>
  NULL
  ["xmlVersion"]=>
  NULL
  ["strictErrorChecking"]=>
  bool(true)
  ["documentURI"]=>
  NULL
  ["config"]=>
  NULL
  ["formatOutput"]=>
  bool(false)
  ["validateOnParse"]=>
  bool(false)
  ["resolveExternals"]=>
  bool(false)
  ["preserveWhiteSpace"]=>
  bool(true)
  ["recover"]=>
  bool(false)
  ["substituteEntities"]=>
  bool(false)
  ["nodeName"]=>
  string(9) "#document"
  ["nodeValue"]=>
  NULL
  ["nodeType"]=>
  int(13)
  ["parentNode"]=>
  NULL
  ["childNodes"]=>
  string(22) "(object value omitted)"
  ["firstChild"]=>
  string(22) "(object value omitted)"
  ["lastChild"]=>
  string(22) "(object value omitted)"
  ["previousSibling"]=>
  NULL
  ["attributes"]=>
  NULL
  ["ownerDocument"]=>
  NULL
  ["namespaceURI"]=>
  NULL
  ["prefix"]=>
  string(0) ""
  ["localName"]=>
  NULL
  ["baseURI"]=>
  NULL
  ["textContent"]=>
  string(0) ""
}

UPD2. Repeating <body> is OK for DomDocument. There were leading whitespaces in the html, solved by adding trim() $dom->loadHTML(trim($result));

DOMDocument 's loadHTML() method copes fairly well with malformed HTML however it is going to generate a lot of errors. You will want to suppress these errors from bubbling up into your default error handler like this:

<?php
// some process of fetching the HTML page
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($scrappedPage);

It might be worthwhile using CURL to grab the file to be scrapped if you are not doing that before passing it to DOMDocument to be sure that you are not suffering from timeout issues while dealing with very bad HTML. This would also enable you to catch the file locally and inspect the errors that are being encountered. It would also mean that you would have a malformed HTML example to show for your next question.

Since PHP 5.4.0 and Libxml 2.6.0, you can also use the optional options parameter to give additional Libxml parameters. Some of these might be of use:

  • LIBXML_HTML_NODEFDTD : prevents a default doctype being added when one is not found
  • LIBXML_PARSEHUGE : relaxes any hardcoded limit from the parser. This affects limits like maximum depth of a document or the entity recursion, as well as limits of the size of text nodes.
  • Read more: http://php.net/manual/en/libxml.constants.php

Should I write a custom regexp to fix broken HTML before sending to DomDocument?

Not before you haven't used Tidy and have understood why it didn't work for you and you have clear understanding how a regex could in that specific case (and in a safe and stable manner).

Maybe I missed some settings for PHP DomDocument?

Perhaps the error handling (see libxml_use_internal_errors() ) and the DOMDocument::$recover field.

But for sure you've missed the numerous existing Q&A material we have on site already about that topic. It contains much more suggestions and we have I think 10+ questions that address the error giving part of your question alone.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM