简体   繁体   中英

PHP cURL and Simple HTML Dom

I'm sorry, but I speak a little English only.

I use this:

<?php

function file_get_contents_curl ( $url ) {

    $ch = curl_init ();

    curl_setopt ( $ch, CURLOPT_AUTOREFERER, TRUE );
    curl_setopt ( $ch, CURLOPT_HEADER, 0 );
    curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
    curl_setopt ( $ch, CURLOPT_URL, $url );
    curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, TRUE );
    curl_setopt ( $ch, CURLOPT_SSL_VERIFYPEER, 0 ); //
    curl_setopt ( $ch, CURLOPT_SSL_VERIFYHOST, 0 ); //
    curl_setopt ( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; rv:71.0) Gecko/20100101 Firefox/71.0' ); // spoof

    $data = curl_exec ( $ch );

    curl_close ( $ch );

    return $data;

}

include ( __DIR__ . '/simplehtmldom_1_9_1/simple_html_dom.php' );

// 1. OK:     $url = 'https://www.p***hub.com/model/ashley-porner';
// 2. OK:     $url = 'https://www.p***hub.com/model/ashley-diamond-and-diamond-king';
// 3. NOT OK: $url = 'https://www.p***hub.com/model/ambercashh';
// 4. NOT OK: $url = 'https://www.p***hub.com/model/autumn-raine';

$html = file_get_contents_curl ( $url );
$html = str_get_html ( $html );

var_dump ( $html ); // boolean(false) if NOT OK

?>

The 1-2. URL is ok, but the 3-4. URL is not ok. Not show, no view. The return is false.

I try change from 600000 to 6000000 (~/simplehtmldom_1_9_1/simple_html_dom.php), but the new value is more loading time and than crashed my website:

// OLD: defined('MAX_FILE_SIZE') || define('MAX_FILE_SIZE', 600000);
defined('MAX_FILE_SIZE') || define('MAX_FILE_SIZE', 6000000); // NEW

What is the problem?

Thanks.

As test you can run the following - obviously the urls will need editing but it shows reasonable performance - why you were running out of memory must therefore lie in code not included

<?php


    function file_get_contents_curl ( $url ) {
        $ch = curl_init ();
        curl_setopt ( $ch, CURLOPT_AUTOREFERER, TRUE );
        curl_setopt ( $ch, CURLOPT_HEADER, 0 );
        curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt ( $ch, CURLOPT_URL, $url );
        curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, TRUE );
        curl_setopt ( $ch, CURLOPT_SSL_VERIFYPEER, 0 );
        curl_setopt ( $ch, CURLOPT_SSL_VERIFYHOST, 0 );
        curl_setopt ( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; rv:71.0) Gecko/20100101 Firefox/71.0' ); // spoof
        $data = curl_exec ( $ch );
        curl_close ( $ch );
        return $data;
    }


    $start=time();
    $memstart=memory_get_usage();


    $baseurl='https://www.*******.com/model/';
    $models=['ashley-porner','ashley-diamond-and-diamond-king','ambercashh','autumn-raine'];


    libxml_use_internal_errors( true );
    $dom=new DOMDocument;
    $dom->validateOnParse=false;
    $dom->recover=true;
    $dom->strictErrorChecking=false;


    /* do some expensive DOM operations to test performance */
    $query='//section[ @class="topProfileHeader" ]/div/div/div[ @class="content-columns" ]/div[ @class="infoPiece" ]';


    foreach( $models as $model ){
        $url = $baseurl . $model;
        $res = file_get_contents_curl( $url );

        $dom->loadHTML( $res );
        $xp=new DOMXPath( $dom );
        libxml_clear_errors();

        $col=$xp->query( $query );
        if( $col->length > 0 ){
            foreach( $col as $node ) {
                echo str_repeat( '.', strlen( $node->nodeValue ) ) . '<br />';
            }
        }
    }

    $memory=memory_get_usage() - $memstart;
    printf(
        '<div style="padding:1rem; border:1px solid red;">Script took approx: %ss - consumed: %sMb, Peak memory consumption: %sMb</div>', 
        ( time() - $start ), 
        round( $memory / pow(1024,2), 2 ), 
        round( memory_get_peak_usage() / pow(1024,2), 2 )
    );

?>  

结果...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM