PHP Dom抓取大量数据

Question

I have to gather some data from over 8000 pages x 25 records per page. 我必须从8000页x每页25条记录中收集一些数据。 That's about over 200.000 records. 大约有200.000条记录。 The problem is that the server rejects my requests after a period of time. 问题是服务器在一段时间后拒绝了我的请求。 Though I've heard it is rather slow, I used simple_html_dom as library for the scraping. 尽管我听说它运行起来很慢，但是我使用simple_html_dom作为抓取的库。 This is the sample data: 这是示例数据：

<table>
<tr>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data1</td>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data2</td>
</tr>
<tr>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data3</td>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data4</td>
</tr>
</table>

And the php scraping script is: 而php抓取脚本是：

<?php

$fileName = 'output.csv';

header("Cache-Control: must-revalidate, post-check=0, pre-check=0");
header('Content-Description: File Transfer');
header("Content-type: text/csv");
header("Content-Disposition: attachment; filename={$fileName}");
header("Expires: 0");
header("Pragma: public");

$fh = @fopen('php://output', 'w');


ini_set('max_execution_time', 300000000000);

include("simple_html_dom.php");

for ($i = 1; $i <= 8846; $i++) {

    scrapeThePage('url_to_scrape/?page=' . $i);
    if ($i % 2 == 0)
        sleep(10);

}

function scrapeThePage($page)
{

    global $theData;


    $html = new simple_html_dom();
    $html->load_file($page);

    foreach ($html->find('table tr') as $row) {
        $rowData = array();
        foreach ($row->find('td[style="font-size:12px;border-bottom:1px dashed #a2a2a2;"]') as $cell) {
            $rowData[] = $cell->innertext;

        }

        $theData[] = $rowData;
    }
}

foreach (array_filter($theData) as $fields) {
    fputcsv($fh, $fields);
}
fclose($fh);
exit();

?>

As you can see, I have added a 10 second sleep interval in the for loop so I won't stress the server with the requests. 如您所见，我在for循环中添加了10秒的睡眠间隔，因此我不会对服务器施加压力。 When it prompts me for the CSV download, I have these lines inside of it: 当它提示我下载CSV时，其中包含以下几行：

Warning : file_get_contents(url_to_scrape/?page=8846): failed to open stream: HTTP request failed! 警告：file_get_contents（url_to_scrape /？page = 8846）：无法打开流：HTTP请求失败！ HTTP/1.0 500 Internal Server Error Fatal error : Call to a member function find() on a non-object in D:\\www\\htdocs\\ucmr\\simple_html_dom.php on line 1113 HTTP / 1.0 500内部服务器错误致命错误 ：在第1113行的D：\\ www \\ htdocs \\ ucmr \\ simple_html_dom.php中的非对象上调用成员函数find（）

The 8846 page does exist and it is the last page of the script. 8846页面确实存在，它是脚本的最后一页。 The page number varies in the error above, so sometimes I receive an error at page 800 for example. 页码在上述错误中有所不同，因此有时例如在800页会出现错误。 Can someone please give me an idea of what am I doing wrong in this situation. 有人可以让我知道在这种情况下我在做什么错。 Any advice would be helpful. 任何意见将是有益的。

Answer 1

Fatal is thrown probably because $html or $row is not an object, it becames null . 致命错误可能是因为$html或$row不是对象，它变为null 。 You should always try to check if object is properly created. 您应该始终尝试检查对象是否正确创建。 Maybe also method $html->load_file($page); 也许还有方法$html->load_file($page); returns false if loading a page fails. 如果加载页面失败，则返回false。

Also get familiar with instanceof - it becames very helpful sometimes. 也要熟悉instanceof有时变得很有帮助。

Another edit: Your code has no data validation AT ALL. 另一个编辑：您的代码完全没有数据验证。 There is no place where you check for uninitialized variables, unloaded objects, or methods executed with errors. 没有地方可以检查未初始化的变量，已卸载的对象或执行有错误的方法。 You should always use those in your code. 您应该始终在代码中使用这些代码。

PHP Dom抓取大量数据

问题描述

1 个解决方案

解决方案1
0 已采纳 2013-07-10 07:40:35

PHP Dom抓取大量数据

问题描述

1 个解决方案

解决方案1 0 已采纳 2013-07-10 07:40:35

解决方案1
0 已采纳 2013-07-10 07:40:35