简体   繁体   English

使用PHP进行网络爬网

[英]Web crawling using PHP

I am writing a simple crawler that fetches links to articles from engadget.com and for each article i save the entire html document 我正在编写一个简单的爬虫,该爬虫从engadget.com获取指向文章的链接,对于每篇文章,我都保存了整个html文档

    $target_url = "http://www.engadget.com/all/page/1/";
    $html = new simple_html_dom();
    $html->load_file($target_url);
    foreach($html->find('script') as $script){
        if($script->type == "application/ld+json"){
            $json_data = strip_tags($script);
            if($content = json_decode($json_data)){
                $listElements = $content->itemListElement;
                foreach($listElements as $element){
                    echo "Running..";
                    $article_url = $element->url;
                    $article_page = new simple_html_dom();
                    try{                            
                        $article_page->load_file($article_url);
                    } catch (Exception $e) {
                        sleep(20);
                        $article_page->load_file($article_url);
                    } finally {
                        $filename = "raw_file".$file_num.".txt";
                        $file = fopen("C:\\xampp\\htdocs\\files\\".$filename,"w");
                        fwrite($file, $article_page);
                        fclose($file);
                        $file_num++;
                    }
                }               
            }
        }
    }

Most of the times this works fine but sometimes a page fails to load and I get a 503 error. 在大多数情况下,这可以正常工作,但有时页面无法加载,并且出现503错误。 To solve this, currently I am suspending the execution for 20 seconds before retrying with the same url. 为了解决这个问题,目前我将执行暂停20秒,然后使用相同的url重试。 This has significantly reduced the fail cases but still sometimes it fails in the second try as well. 这显着减少了失败案例,但有时仍会在第二次尝试中失败。 Is there a better way to make sure that I get the data from the page. 有没有更好的方法来确保我从页面获取数据。 Is there a way to keep trying till the page responds? 有没有办法继续尝试直到页面响应?

The website might have set up a request interval limitation to avoid data harvesting. 该网站可能已设置请求间隔限制,以避免收集数据。 For a reason... So, don't just copy someone else's site contents :) 由于某种原因...所以,不要只复制别人的网站内容:)

Or if there is a API, use that to load/get the contents. 或者,如果有API,请使用该API加载/获取内容。

(Technically, you could let your site loop requests until it has a right response, using intervals and resetting time limits to avoid PHP from stopping.) (从技术上讲,您可以使用间隔和重置时间限制来避免PHP停止,从而让站点循环请求直到获得正确的响应为止。)

Maybe a good idea to increase the interval dinamically everytime an exception occurs and try again, something like: 每次发生异常时都可以大幅度增加间隔时间,然后重试,例如:

foreach ($listElements as $element) {
    echo "Running..";
    $article_url = $element->url;
    $article_page = new simple_html_dom();
    $interval = 0;
    $tries = 0;
    $success = false;

    while (!$suceess && $tries < 5) {
        try {
            sleep($interval);               
            $article_page->load_file($article_url);
            $success = true;
        } catch (Exception $e) {
            $interval += 20;
            $tries ++;
            $article_page->load_file($article_url);
        } finally {
            $filename = "raw_file".$file_num.".txt";
            $file = fopen("C:\\xampp\\htdocs\\files\\".$filename,"w");
            fwrite($file, $article_page);
            fclose($file);
            $file_num++;
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM