简体   繁体   中英

PHP Dom Scraping large amount of data

I have to gather some data from over 8000 pages x 25 records per page. That's about over 200.000 records. The problem is that the server rejects my requests after a period of time. Though I've heard it is rather slow, I used simple_html_dom as library for the scraping. This is the sample data:

<table>
<tr>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data1</td>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data2</td>
</tr>
<tr>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data3</td>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data4</td>
</tr>
</table>

And the php scraping script is:

<?php

$fileName = 'output.csv';

header("Cache-Control: must-revalidate, post-check=0, pre-check=0");
header('Content-Description: File Transfer');
header("Content-type: text/csv");
header("Content-Disposition: attachment; filename={$fileName}");
header("Expires: 0");
header("Pragma: public");

$fh = @fopen('php://output', 'w');


ini_set('max_execution_time', 300000000000);

include("simple_html_dom.php");

for ($i = 1; $i <= 8846; $i++) {

    scrapeThePage('url_to_scrape/?page=' . $i);
    if ($i % 2 == 0)
        sleep(10);

}

function scrapeThePage($page)
{

    global $theData;


    $html = new simple_html_dom();
    $html->load_file($page);

    foreach ($html->find('table tr') as $row) {
        $rowData = array();
        foreach ($row->find('td[style="font-size:12px;border-bottom:1px dashed #a2a2a2;"]') as $cell) {
            $rowData[] = $cell->innertext;

        }

        $theData[] = $rowData;
    }
}

foreach (array_filter($theData) as $fields) {
    fputcsv($fh, $fields);
}
fclose($fh);
exit();

?>

As you can see, I have added a 10 second sleep interval in the for loop so I won't stress the server with the requests. When it prompts me for the CSV download, I have these lines inside of it:

: file_get_contents(url_to_scrape/?page=8846): failed to open stream: HTTP request failed! :file_get_contents(url_to_scrape /?page = 8846):无法打开流:HTTP请求失败! HTTP/1.0 500 Internal Server Error : Call to a member function find() on a non-object in on line :在第行的的非对象上调用成员函数find()

The 8846 page does exist and it is the last page of the script. The page number varies in the error above, so sometimes I receive an error at page 800 for example. Can someone please give me an idea of what am I doing wrong in this situation. Any advice would be helpful.

Fatal is thrown probably because $html or $row is not an object, it becames null . You should always try to check if object is properly created. Maybe also method $html->load_file($page); returns false if loading a page fails.

Also get familiar with instanceof - it becames very helpful sometimes.

Another edit: Your code has no data validation AT ALL. There is no place where you check for uninitialized variables, unloaded objects, or methods executed with errors. You should always use those in your code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM