简体   繁体   中英

Web crawling using PHP

I am writing a simple crawler that fetches links to articles from engadget.com and for each article i save the entire html document

    $target_url = "http://www.engadget.com/all/page/1/";
    $html = new simple_html_dom();
    $html->load_file($target_url);
    foreach($html->find('script') as $script){
        if($script->type == "application/ld+json"){
            $json_data = strip_tags($script);
            if($content = json_decode($json_data)){
                $listElements = $content->itemListElement;
                foreach($listElements as $element){
                    echo "Running..";
                    $article_url = $element->url;
                    $article_page = new simple_html_dom();
                    try{                            
                        $article_page->load_file($article_url);
                    } catch (Exception $e) {
                        sleep(20);
                        $article_page->load_file($article_url);
                    } finally {
                        $filename = "raw_file".$file_num.".txt";
                        $file = fopen("C:\\xampp\\htdocs\\files\\".$filename,"w");
                        fwrite($file, $article_page);
                        fclose($file);
                        $file_num++;
                    }
                }               
            }
        }
    }

Most of the times this works fine but sometimes a page fails to load and I get a 503 error. To solve this, currently I am suspending the execution for 20 seconds before retrying with the same url. This has significantly reduced the fail cases but still sometimes it fails in the second try as well. Is there a better way to make sure that I get the data from the page. Is there a way to keep trying till the page responds?

The website might have set up a request interval limitation to avoid data harvesting. For a reason... So, don't just copy someone else's site contents :)

Or if there is a API, use that to load/get the contents.

(Technically, you could let your site loop requests until it has a right response, using intervals and resetting time limits to avoid PHP from stopping.)

Maybe a good idea to increase the interval dinamically everytime an exception occurs and try again, something like:

foreach ($listElements as $element) {
    echo "Running..";
    $article_url = $element->url;
    $article_page = new simple_html_dom();
    $interval = 0;
    $tries = 0;
    $success = false;

    while (!$suceess && $tries < 5) {
        try {
            sleep($interval);               
            $article_page->load_file($article_url);
            $success = true;
        } catch (Exception $e) {
            $interval += 20;
            $tries ++;
            $article_page->load_file($article_url);
        } finally {
            $filename = "raw_file".$file_num.".txt";
            $file = fopen("C:\\xampp\\htdocs\\files\\".$filename,"w");
            fwrite($file, $article_page);
            fclose($file);
            $file_num++;
        }
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM