简体   繁体   English

如何使用 cURL 获取页面内容?

[英]How to get page content using cURL?

I would like to scrape the content of this Google search result page using curl.我想使用 curl 抓取此Google 搜索结果页面的内容。 I've been trying setting different user agents, and setting other options but I just can't seem to get the content of that page, as I often get redirected or I get a "page moved" error.我一直在尝试设置不同的用户代理,并设置其他选项,但我似乎无法获取该页面的内容,因为我经常被重定向或出现“页面移动”错误。

I believe it has something to do with the fact that the query string gets encoded somewhere but I'm really not sure how to get around that.我相信这与查询字符串在某处编码的事实有关,但我真的不知道如何解决这个问题。

    //$url is the same as the link above
    $ch = curl_init();
    $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt ($ch, CURLOPT_HEADER, 0);
    curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch,CURLOPT_CONNECTTIMEOUT,120);
    curl_setopt ($ch,CURLOPT_TIMEOUT,120);
    curl_setopt ($ch,CURLOPT_MAXREDIRS,10);
    curl_setopt ($ch,CURLOPT_COOKIEFILE,"cookie.txt");
    curl_setopt ($ch,CURLOPT_COOKIEJAR,"cookie.txt");
    echo curl_exec ($ch);

What do I need to do to get my php code to show the exact content of the page as I would see it on my browser?我需要做什么才能让我的 php 代码显示页面的确切内容,就像我在浏览器上看到的那样? What am I missing?我错过了什么? Can anyone point me to the right direction?任何人都可以指出我正确的方向吗?

I've seen similar questions on SO, but none with an answer that could help me.我在 SO 上看到过类似的问题,但没有一个答案可以帮助我。

EDIT:编辑:

I tried to just open the link using the Selenium WebDriver, that gives the same results as cURL.我尝试使用 Selenium WebDriver 打开链接,结果与 cURL 相同。 I am still thinking that this has to do with the fact that there are special characters in the query string which are getting messed up somewhere in the process.我仍然认为这与查询字符串中有特殊字符在过程中的某个地方弄乱了这一事实有关。

this is how:这是如何:

   /**
     * Get a web file (HTML, XHTML, XML, image, etc.) from a URL.  Return an
     * array containing the HTTP server response header fields and content.
     */
    function get_web_page( $url )
    {
        $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';

        $options = array(

            CURLOPT_CUSTOMREQUEST  =>"GET",        //set request type post or get
            CURLOPT_POST           =>false,        //set to GET
            CURLOPT_USERAGENT      => $user_agent, //set user agent
            CURLOPT_COOKIEFILE     =>"cookie.txt", //set cookie file
            CURLOPT_COOKIEJAR      =>"cookie.txt", //set cookie jar
            CURLOPT_RETURNTRANSFER => true,     // return web page
            CURLOPT_HEADER         => false,    // don't return headers
            CURLOPT_FOLLOWLOCATION => true,     // follow redirects
            CURLOPT_ENCODING       => "",       // handle all encodings
            CURLOPT_AUTOREFERER    => true,     // set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
            CURLOPT_TIMEOUT        => 120,      // timeout on response
            CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        );

        $ch      = curl_init( $url );
        curl_setopt_array( $ch, $options );
        $content = curl_exec( $ch );
        $err     = curl_errno( $ch );
        $errmsg  = curl_error( $ch );
        $header  = curl_getinfo( $ch );
        curl_close( $ch );

        $header['errno']   = $err;
        $header['errmsg']  = $errmsg;
        $header['content'] = $content;
        return $header;
    }

Example例子

//Read a web page and check for errors:

$result = get_web_page( $url );

if ( $result['errno'] != 0 )
    ... error: bad url, timeout, redirect loop ...

if ( $result['http_code'] != 200 )
    ... error: no page, no permissions, no service ...

$page = $result['content'];

For a realistic approach that emulates the most human behavior, you may want to add a referer in your curl options.对于模拟大多数人类行为的现实方法,您可能希望在 curl 选项中添加引用。 You may also want to add a follow_location to your curl options.您可能还想在 curl 选项中添加一个 follow_location。 Trust me, whoever said that cURLING Google results is impossible, is a complete dolt and should throw his/her computer against the wall in hopes of never returning to the internetz again.相信我,谁说 cURLING Google 结果是不可能的,是一个彻头彻尾的傻瓜,应该把他/她的电脑扔到墙上,希望永远不会再回到 internetz。 Everything that you can do "IRL" with your own browser can all be emulated using PHP cURL or libCURL in Python.您可以使用自己的浏览器执行“IRL”的所有操作都可以在 Python 中使用 PHP cURL 或 libCURL 进行模拟。 You just need to do more cURLS to get buff.你只需要做更多的卷曲来获得增益。 Then you will see what I mean.然后你就会明白我的意思。 :) :)

  $url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_REFERER, 'http://www.example.com/1');
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_VERBOSE, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
  curl_setopt($ch, CURLOPT_URL, urlencode($url));
  $response = curl_exec($ch);
  curl_close($ch);

Try This:试试这个:

$url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_VERBOSE, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
  curl_setopt($ch, CURLOPT_URL, urlencode($url));
  $response = curl_exec($ch);
  curl_close($ch);

I suppose that have you noticed that your link is actually an HTTPS link.... It seems that CURL parameters do not include any kind of SSH handling... maybe this could be your problem.我想你有没有注意到你的链接实际上是一个 HTTPS 链接......似乎 CURL 参数不包括任何类型的 SSH 处理......也许这可能是你的问题。 Why don't you try with a non-HTTPS link to see what happens (ie Google Custom Search Engine)...?为什么不尝试使用非 HTTPS 链接来查看会发生什么(即 Google 自定义搜索引擎)...?

Get content with Curl php使用 Curl php 获取内容

request server support Curl function, enable in httpd.conf in folder Apache请求服务器支持 Curl 功能,在 Apache 文件夹中的 httpd.conf 中启用


function UrlOpener($url)
     global $output;
     $ch = curl_init(); 
     curl_setopt($ch, CURLOPT_URL, $url); 
     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
     $output = curl_exec($ch); 
     curl_close($ch);    
     echo $output;

If get content by google cache use Curl you can use this url: http://webcache.googleusercontent.com/search?q=cache:Put your url Sample: http://urlopener.mixaz.net/如果通过谷歌缓存使用 Curl 获取内容,您可以使用以下网址: http : //webcache.googleusercontent.com/search?q = cache: Put your url Sample: http://urlopener.mixaz.net/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM