简体   繁体   English

PHP Dom抓取大量数据

[英]PHP Dom Scraping large amount of data

I have to gather some data from over 8000 pages x 25 records per page. 我必须从8000页x每页25条记录中收集一些数据。 That's about over 200.000 records. 大约有200.000条记录。 The problem is that the server rejects my requests after a period of time. 问题是服务器在一段时间后拒绝了我的请求。 Though I've heard it is rather slow, I used simple_html_dom as library for the scraping. 尽管我听说它运行起来很慢,但是我使用simple_html_dom作为抓取的库。 This is the sample data: 这是示例数据:

<table>
<tr>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data1</td>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data2</td>
</tr>
<tr>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data3</td>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data4</td>
</tr>
</table>

And the php scraping script is: 而php抓取脚本是:

<?php

$fileName = 'output.csv';

header("Cache-Control: must-revalidate, post-check=0, pre-check=0");
header('Content-Description: File Transfer');
header("Content-type: text/csv");
header("Content-Disposition: attachment; filename={$fileName}");
header("Expires: 0");
header("Pragma: public");

$fh = @fopen('php://output', 'w');


ini_set('max_execution_time', 300000000000);

include("simple_html_dom.php");

for ($i = 1; $i <= 8846; $i++) {

    scrapeThePage('url_to_scrape/?page=' . $i);
    if ($i % 2 == 0)
        sleep(10);

}

function scrapeThePage($page)
{

    global $theData;


    $html = new simple_html_dom();
    $html->load_file($page);

    foreach ($html->find('table tr') as $row) {
        $rowData = array();
        foreach ($row->find('td[style="font-size:12px;border-bottom:1px dashed #a2a2a2;"]') as $cell) {
            $rowData[] = $cell->innertext;

        }

        $theData[] = $rowData;
    }
}

foreach (array_filter($theData) as $fields) {
    fputcsv($fh, $fields);
}
fclose($fh);
exit();

?>

As you can see, I have added a 10 second sleep interval in the for loop so I won't stress the server with the requests. 如您所见,我在for循环中添加了10秒的睡眠间隔,因此我不会对服务器施加压力。 When it prompts me for the CSV download, I have these lines inside of it: 当它提示我下载CSV时,其中包含以下几行:

Warning : file_get_contents(url_to_scrape/?page=8846): failed to open stream: HTTP request failed! 警告 :file_get_contents(url_to_scrape /?page = 8846):无法打开流:HTTP请求失败! HTTP/1.0 500 Internal Server Error Fatal error : Call to a member function find() on a non-object in D:\\www\\htdocs\\ucmr\\simple_html_dom.php on line 1113 HTTP / 1.0 500内部服务器错误致命错误 :在第1113行的D:\\ www \\ htdocs \\ ucmr \\ simple_html_dom.php中的非对象上调用成员函数find()

The 8846 page does exist and it is the last page of the script. 8846页面确实存在,它是脚本的最后一页。 The page number varies in the error above, so sometimes I receive an error at page 800 for example. 页码在上述错误中有所不同,因此有时例如在800页会出现错误。 Can someone please give me an idea of what am I doing wrong in this situation. 有人可以让我知道在这种情况下我在做什么错。 Any advice would be helpful. 任何意见将是有益的。

Fatal is thrown probably because $html or $row is not an object, it becames null . 致命错误可能是因为$html$row不是对象,它变为null You should always try to check if object is properly created. 您应该始终尝试检查对象是否正确创建。 Maybe also method $html->load_file($page); 也许还有方法$html->load_file($page); returns false if loading a page fails. 如果加载页面失败,则返回false。

Also get familiar with instanceof - it becames very helpful sometimes. 也要熟悉instanceof有时变得很有帮助。

Another edit: Your code has no data validation AT ALL. 另一个编辑:您的代码完全没有数据验证。 There is no place where you check for uninitialized variables, unloaded objects, or methods executed with errors. 没有地方可以检查未初始化的变量,已卸载的对象或执行有错误的方法。 You should always use those in your code. 您应该始终在代码中使用这些代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM