繁体   English   中英

PHP cURL 和简单的 HTML Dom

[英]PHP cURL and Simple HTML Dom

对不起,我只会说一点英语。

我用这个:

<?php

function file_get_contents_curl ( $url ) {

    $ch = curl_init ();

    curl_setopt ( $ch, CURLOPT_AUTOREFERER, TRUE );
    curl_setopt ( $ch, CURLOPT_HEADER, 0 );
    curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
    curl_setopt ( $ch, CURLOPT_URL, $url );
    curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, TRUE );
    curl_setopt ( $ch, CURLOPT_SSL_VERIFYPEER, 0 ); //
    curl_setopt ( $ch, CURLOPT_SSL_VERIFYHOST, 0 ); //
    curl_setopt ( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; rv:71.0) Gecko/20100101 Firefox/71.0' ); // spoof

    $data = curl_exec ( $ch );

    curl_close ( $ch );

    return $data;

}

include ( __DIR__ . '/simplehtmldom_1_9_1/simple_html_dom.php' );

// 1. OK:     $url = 'https://www.p***hub.com/model/ashley-porner';
// 2. OK:     $url = 'https://www.p***hub.com/model/ashley-diamond-and-diamond-king';
// 3. NOT OK: $url = 'https://www.p***hub.com/model/ambercashh';
// 4. NOT OK: $url = 'https://www.p***hub.com/model/autumn-raine';

$html = file_get_contents_curl ( $url );
$html = str_get_html ( $html );

var_dump ( $html ); // boolean(false) if NOT OK

?>

1-2。 URL 是可以的,但是 3-4。 网址不对。 不显示,不查看。 回报是假的。

我尝试从 600000 更改为 6000000 (~/simplehtmldom_1_9_1/simple_html_dom.php),但新值的加载时间更长,而不是使我的网站崩溃:

// OLD: defined('MAX_FILE_SIZE') || define('MAX_FILE_SIZE', 600000);
defined('MAX_FILE_SIZE') || define('MAX_FILE_SIZE', 6000000); // NEW

问题是什么?

谢谢。

作为测试,您可以运行以下内容 - 显然 url 需要编辑,但它显示出合理的性能 - 因此,为什么您的内存不足,因此必须位于未包含的代码中

<?php


    function file_get_contents_curl ( $url ) {
        $ch = curl_init ();
        curl_setopt ( $ch, CURLOPT_AUTOREFERER, TRUE );
        curl_setopt ( $ch, CURLOPT_HEADER, 0 );
        curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt ( $ch, CURLOPT_URL, $url );
        curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, TRUE );
        curl_setopt ( $ch, CURLOPT_SSL_VERIFYPEER, 0 );
        curl_setopt ( $ch, CURLOPT_SSL_VERIFYHOST, 0 );
        curl_setopt ( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; rv:71.0) Gecko/20100101 Firefox/71.0' ); // spoof
        $data = curl_exec ( $ch );
        curl_close ( $ch );
        return $data;
    }


    $start=time();
    $memstart=memory_get_usage();


    $baseurl='https://www.*******.com/model/';
    $models=['ashley-porner','ashley-diamond-and-diamond-king','ambercashh','autumn-raine'];


    libxml_use_internal_errors( true );
    $dom=new DOMDocument;
    $dom->validateOnParse=false;
    $dom->recover=true;
    $dom->strictErrorChecking=false;


    /* do some expensive DOM operations to test performance */
    $query='//section[ @class="topProfileHeader" ]/div/div/div[ @class="content-columns" ]/div[ @class="infoPiece" ]';


    foreach( $models as $model ){
        $url = $baseurl . $model;
        $res = file_get_contents_curl( $url );

        $dom->loadHTML( $res );
        $xp=new DOMXPath( $dom );
        libxml_clear_errors();

        $col=$xp->query( $query );
        if( $col->length > 0 ){
            foreach( $col as $node ) {
                echo str_repeat( '.', strlen( $node->nodeValue ) ) . '<br />';
            }
        }
    }

    $memory=memory_get_usage() - $memstart;
    printf(
        '<div style="padding:1rem; border:1px solid red;">Script took approx: %ss - consumed: %sMb, Peak memory consumption: %sMb</div>', 
        ( time() - $start ), 
        round( $memory / pow(1024,2), 2 ), 
        round( memory_get_peak_usage() / pow(1024,2), 2 )
    );

?>  

结果...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM