简体   繁体   中英

Symfony DomCrawler empty object

I'm trying to scrape the rating score of review sites, using Laravel 4 and the Symfony DomCrawler. Let's take this site as an example: http://estorereview.com.au/s/5951/A-Supplements I want to get the 4.8 of 5 Stars

This is partial code of my attempt:

<?php

use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\CssSelector\CssSelector;

function getRatingEstoreReview($url){
  $html = getHtmlCurl($url);
  $crawler = new Crawler($html);
  $crawler = $crawler->filter('span[itemprop="ratingValue"]'); 
  var_dump($crawler);
  die("test");
  return normalize($crawler,5);
}

The var_dump returns following:

object(Symfony\Component\DomCrawler\Crawler)[280]
  protected 'uri' => null
  private 'defaultNamespacePrefix' => string 'default' (length=7)
  private 'namespaces' => 
    array (size=0)
      empty

I tried this with other sites etc. but I'm always getting an empty object. Accessing the value with $crawler->first doesn't work as well.

What am I doing wrong? Thank you.

Edit: Even if I'm filtering for "div" the Crawler remains empty. The PHP Simple HTML DOM Parser works fine

The full CSS path for that element is body > div:nth-child(3) > div > div > div.left-container.floatl > div.top > div.top-inner > div.store-rating-container.floatl > div.star-col.floatl.overall-rating-stars > div.rating-text.floatl > div > strong > span . Have you tried using that as your filter term instead?

You can also use filterXPath() instead, in which case you're looking for /html/body/div[3]/div/div/div[4]/div[1]/div[2]/div[2]/div[1]/div[2]/div/strong/span .

Edit: it doesn't look like it applies to this specific page, but just wanted to mention a "gotcha" for web crawling. Remember that for some web pages, the contents will have been manipulated (post-load) by JavaScript. In that case, the elements you're looking for may not be seen by DomCrawler at all.

Update:

Here are the results I see. I'm using Goutte rather than getHtmlCurl() .

Code:

use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

$client = new Client();
$crawler = $client->request('GET', 'http://estorereview.com.au/s/5951/A-Supplements');
var_dump($crawler->filter('span[itemprop="ratingValue"]')); 
echo $crawler->filter('span[itemprop="ratingValue"]')->text();
die("<br />test completed");

Output:

object(Symfony\Component\DomCrawler\Crawler)[177]
  protected 'uri' => string 'http://estorereview.com.au/s/5951/A-Supplements' (length=47)
  private 'defaultNamespacePrefix' => string 'default' (length=7)
  private 'namespaces' => 
    array (size=0)
      empty
4.8
test completed

So, that works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM