php xPath code optimization

Question

I'm writing a page scraper for a site that is a little slow, but has a lot of information I'd like to use for widget purposes (with their permission). Currently it takes roughly 4-5 minutes to execute and parse all ~150 pages I scrape so far. It will be a crontab 'd event, and a temporary table is used while it's being generated, then copied to a "live" table upon completion so it's a seamless transition from a client stand-point, however can you see a way to speed up my code, possibly?

//mysql connection stuff here
function dnl2array($domnodelist) {
    $return = array();
    $nb = $domnodelist->length;
    for ($i = 0; $i < $nb; ++$i) {
        $return['pt'][] = utf8_decode(trim($domnodelist->item($i)->nodeValue));
        $return['html'][] = utf8_decode(trim(get_inner_html($domnodelist->item($i))));
    }
    return $return;
}

function get_inner_html( $node ) { 
    $innerHTML= ''; 
    $children = $node->childNodes; 
    foreach ($children as $child) { 
        $innerHTML .= $child->ownerDocument->saveXML( $child ); 
    } 

    return $innerHTML; 
}

// NEW curl instead of file_get_contents()
    $c = curl_init($url);
    curl_setopt($c, CURLOPT_HEADER, false);
    curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
    curl_setopt($c, CURLOPT_FAILONERROR, true);
    curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($c, CURLOPT_AUTOREFERER, true);
    curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($c, CURLOPT_TIMEOUT, 20);

    // Grab the data.
    $html = curl_exec($c);

    // Check if the HTML didn't load right, if it didn't - report an error
    if (!$html) {
        echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
             "<p>cURL error: " . curl_error($c) . "</p>";
    }

// $html = file_get_contents($url);
$doc = new DOMDocument;

// Load the html into our object
$doc->loadHTML($html);

$xPath = new DOMXPath( $doc );

// scrape initial page that contains list of everything I want to scrape
$results = $xPath->query('//div[@id="food-plan-contents"]//td[@class="product-name"]');
$test['itams'] = dnl2array($results);

foreach($test['itams']['html'] as $get_url){
    $prepared_url[] = ""; // The url being scraped, modified slightly to gain access to more information -- not SO applicable data to see
}
$i = 0;
    foreach($prepared_url as $url){

    $c = curl_init($url);
    curl_setopt($c, CURLOPT_HEADER, false);
    curl_setopt($c, CURLOPT_USERAGENT, getUserAgent());
    curl_setopt($c, CURLOPT_FAILONERROR, true);
    curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($c, CURLOPT_AUTOREFERER, true);
    curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($c, CURLOPT_TIMEOUT, 20);

    // Grab the data.
    $html = curl_exec($c);

    // Check if the HTML didn't load right, if it didn't - report an error
    if (!$html) {
        echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
             "<p>cURL error: " . curl_error($c) . "</p>";
    }

// $html = file_get_contents($url);
        $doc = new DOMDocument;
        $doc->loadHTML($html);

        $xPath = new DOMXPath($doc);

        $results = $xPath->query('//h3[@class="product-name"]');
        $arr[$i]['name'] = dnl2array($results);

        $results = $xPath->query('//div[@class="product-specs"]');
        $arr[$i]['desc'] = dnl2array($results);

        $results = $xPath->query('//p[@class="product-image-zoom"]');
        $arr[$i]['img'] = dnl2array($results);

        $results = $xPath->query('//div[@class="groupedTable"]/table/tbody/tr//span[@class="price"]');
        $arr[$i]['price'] = dnl2array($results);
        $arr[$i]['url'] = $url;
        if($i % 5 == 1){
            lazy_loader($arr); //lazy loader adds data to sql database
            unset($arr); // keep memory footprint light (server is wimpy -- but free!)
        }

        $i++;
        usleep(50000); // Don't be bandwith pig
    }
        // Get any stragglers
        if(count($arr) > 0){
            lazy_loader($arr);
            $time = time() + (23 * 60 * 60); // Time + 23 hours for "tomorrow's date"
            $tab_name = "sr_data_items_" . date("m_d_y", $time);
            // and copy table now that script is finished
            mysql_query("CREATE TABLE IF NOT EXISTS `{$tab_name}` LIKE `sr_data_items_skel`");
            mysql_query("INSERT INTO `{$tab_name}` SELECT * FROM `sr_data_items_skel`");
            mysql_query("TRUNCATE TABLE  `sr_data_items_skel`");
        }

Answer 1

It sounds like you're mostly dealing with slow server response speeds. At even 2 seconds for each of those 150 pages, you're looking at 300 seconds = 5 minutes. The best way you could speed this up is by using curl_multi_* to run multiple connections at the same time.

So replace the start of the foreach loop (up through the if !html check) with this:

reset($prepared_url); // set internal pointer to first element
$running = array(); // map from curl reference to url
$finished = false;

$mh = curl_multi_init();


$i = 0;
while(!$finished || !empty($running)){
    // add urls to $mh up to a maximum
    while (count($running) < 15 && !$finished)
    {
        $url = next($prepared_url);
        if ($url === FALSE)
        {
            $finished = true;
            break;
        }

        $c = setupcurl($url);

        curl_multi_add_handle($mh, $c);

        $running[$c] = $url;
    }

    curl_multi_exec($mh, $active);
    $info = curl_multi_info_read($mh);
    if (false === $info) continue; // nothing to report right now

    $c = $info['handle'];
    $url = $running[$c];
    unset($running[$c]);

    $result = $info['result'];
    if ($result != CURLE_OK)
    {
        echo "Curl Error: " . $result . "\n";
        continue;
    }

    $html = curl_multi_getcontent($c);

    $download_time = curl_getinfo($c, CURLINFO_TOTAL_TIME);

    curl_multi_remove_handle($mh, $c);



    // Check if the HTML didn't load right, if it didn't - report an error
    if (!$html) {
        echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>\n" .
             "<p>cURL error: " . curl_error($c) . "</p>\n";
    }

    curl_close($c);

    <<rest of foreach loop here>>

That will keep 15 downloads going at the same time, and process them as they finish.

Answer 2

Anyway – so for the history: please see my comments up top.

As for caching: I'm using dnsmasq to cache.

My setup is using a recipe for chef, which I run through chef-solo. The templates contains my configuration and the attributes contain my settings. It's pretty straight forward.

So the beauty is that this allows me to put this server into DHCP (we use Amazon EC2 and this service distributes all IPs via DHCP to the virtual instances) and then I don't have to make any changes to my application to use them.

I have another recipe to edit /etc/dhclient.conf .

Does this help? Let me know where to elaborate more.

EDIT

Just for clarification: This is not a Ruby solution I'm just using chef for configuration management (this part makes sure that services are always setup the same, etc..). Dnsmasq itself acts as a local DNS server and saves the requests so it speeds up.

The manual way is as follows:

On a Ubuntu:

apt-get install dnsmasq

Then edit the /etc/dnsmasq.conf :

listen-address=127.0.0.1
cache-size=5000
domain-needed
bogus-priv
log-queries

Restart service and verify it's running ( ps aux|grep dnsmasq ).

Then put it into your /etc/resolv.conf :

nameserver 127.0.0.1

Test:

dig @127.0.0.1 stackoverflow.com

Execute twice, check time it took to resolve. Second one should be faster.

Enjoy! ;)

Answer 3

The first thing to do is to measure how much time is spent downloading the file from the server. Use function microtime(true) to get a timestamp both before and after the call

file_get_contents($url);

and subtract the values. After you find out that the real bottleneck is inside your code and not on the side of network or remote server, only then you can start thinking about some optimizations.

When you say that 150 pages takes 5 minutes to load & parse, that's 2 seconds per page, and my wild guess is that most of that time is spent to download the page from the server.

Answer 4

You should consider using cUrl instead of both file_get_contents() and DOMDocument::loadHTMLFile, because it's much faster. See this question: https://stackoverflow.com/questions/555523/file-get-contents-vs-curl-what-has-better-performance

Answer 5

You need to benchmark. DNS is not an issue, if you're scrapping 150 pages, DNS will for sure get cached on your resolver for the 4 minutes you need to parse the rest of the 149 pages.

Try timing page all transfers with wget/curl, you may get surprised that it's not so fast as you may think.

Try requesting in parallel, hitting them with 4 parallel requests will get your time down to 1 minute.

If you actually find that it's xpath problem use preg_split() or even an awk script with popen() to get your values.

php xPath code optimization

Question

5 answers

solution1
6 ACCPTED 2012-04-22 00:37:54

solution2
3 2012-04-19 17:14:27

solution3
1 2012-04-19 16:18:33

solution4
1 2012-04-21 20:12:14

solution5
0 2012-04-21 20:15:29

php xPath code optimization

Question

5 answers

solution1 6 ACCPTED 2012-04-22 00:37:54

solution2 3 2012-04-19 17:14:27

solution3 1 2012-04-19 16:18:33

solution4 1 2012-04-21 20:12:14

solution5 0 2012-04-21 20:15:29

solution1
6 ACCPTED 2012-04-22 00:37:54

solution2
3 2012-04-19 17:14:27

solution3
1 2012-04-19 16:18:33

solution4
1 2012-04-21 20:12:14

solution5
0 2012-04-21 20:15:29