简体   繁体   中英

Scraping with DOMDocument PHP

This is the current code that I have for scraping.

$item is the HTML for the div HTML within the loop.

$doc = DOMDocument::loadHTML($item);
$xpath = new DOMXPath($doc);
$link = "//a[@class='s-item__link']";
$entries = $xpath->query($link);
foreach ($entries as $entry) {
    // do work here
}

I am changing the first two lines to be...

$doc = new DOMDocument();
$xpath = $doc->load($item);

With that, I am getting the following error...

Fatal error: Uncaught Error: Call to a member function query() on bool in

The error is coming in from $entries = $xpath->query($link); and I can not figure out where to change this line to.

Any help would be appreciated.

UPDATE: same error

$doc = new DOMDocument();
$xpath = $doc->loadHTML($item);
$link = "//a[@class='s-item__link']";
$entries = $xpath->query($link);
foreach ($entries as $entry) {
    // do work here
}

Look at the return value from DOMDocument:load() ...

Returns true on success or false on failure. If called statically , returns a DOMDocument or false on failure.

Emphasis: Mine. Notice that you're not calling it statically anymore with your change.

So, with code like, $xpath = $doc->load($item);, of course $xpath will need to be a bool (true or false), and your error makes total sense: Fatal error: Uncaught Error: Call to a member function query() on bool .

I just scooped out the Xpath stuff I'm using right now for my own PHP scraper. This should work...

$dom = new DOMDocument;
@$dom->loadHTML(mb_convert_encoding($htmltext, 'HTML-ENTITIES', 'UTF-8'));
$xpath = new DOMXPath($dom);

Explanation:

  • new DOMDocument : New class instance of DOMDocument() .
  • @$dom->loadHTML : The @ symbol suppresses warnings, and this class is very wordy with its errors, you don't want to see them all the time.
  • mb_convert_encoding($htmltext, 'HTML-ENTITIES', 'UTF-8') : loadHTML() appreciates properly UTF-8 encoded text, also, mb_convert_encoding() is optimized for massive strings.
  • new DOMXPath($dom); : New class instance of DOMXPath() .

->load expects a filename as first parameter as shown in the documentation .

In your first code block, you use loadHTML .

Use ->loadHTML instead off ->load on an empty DomDocument :

$doc = new DOMDocument();
$xpath = $doc->loadHTML($item);

public load ( string $filename , int $options = 0 )         : DOMDocument|bool
public loadHTML ( string $source , int $options = 0 )       : DOMDocument|bool
public loadHTMLFile ( string $filename , int $options = 0 ) : DOMDocument|bool

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM