简体   繁体   English

从阻止了CURL的页面中获取HTML

[英]Grabbing HTML From a Page That Has Blocked CURL

I have been asked to grab a certain line from a page but it appears that site has blocked CURL requests? 我被要求从页面中抓取某行,但似乎该网站已阻止CURL请求?

The site in question is http://www.habbo.com/home/Intricat 有问题的网站是http://www.habbo.com/home/Intricat

I tried changing the UserAgent to see if they were blocking that but it didn't seem to do the trick. 我尝试更改UserAgent,以查看他们是否阻止了该操作,但似乎并没有解决问题。

The code I am using is as follows: 我使用的代码如下:

<?php

$curl_handle=curl_init();
//This is the URL you would like the content grabbed from
curl_setopt($curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0");
curl_setopt($curl_handle,CURLOPT_URL,'http://www.habbo.com/home/Intricat');
//This is the amount of time in seconds until it times out, this is useful if the server you are requesting data from is down. This way you can offer a "sorry page"
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);

curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
$buffer = curl_exec($curl_handle);
//This Keeps everything running smoothly
curl_close($curl_handle);

// Change the message bellow as you wish, please keep in mind you must have your message within the " " Quotes.
if (empty($buffer))
{
    print "Sorry, It seems our weather resources are currently unavailable, please check back later.";
}
else
{
    print $buffer;
}
?>

Any ideas on another way I can grab a line of code from that page if they've blocked CURL requests? 如果有其他想法阻止了CURL请求,我可以从该页面获取一行代码吗?

EDIT: On running curl -i through my server, it appears that the site is setting a cookie first? 编辑:通过我的服务器运行curl -i时,该站点似乎首先设置了cookie?

go in with your browser and copy the exact headers that are being send, the site won't be able to tell that your are trying to curl because the request will look exactly the same. 进入浏览器并复制要发送的确切标题,该站点将无法告诉您您正在尝试卷曲,因为请求看起来完全一样。 if cookies are used - attach them as headers. 如果使用了cookie,请附加它们作为标题。

This is a cut and paste from my Curl class I did quite a few years back, hope you can pick some gems out of it for yourself. 这是几年前我在我的Curl类中进行的剪切和粘贴,希望您可以自己挑选一些宝石。

function get_url($url)
{ 
    curl_setopt ($this->ch, CURLOPT_URL, $url); 
    curl_setopt ($this->ch, CURLOPT_USERAGENT, $this->user_agent);
    curl_setopt ($this->ch, CURLOPT_COOKIEFILE, $this->cookie_name);
    curl_setopt ($this->ch, CURLOPT_COOKIEJAR, $this->cookie_name);
    if(!is_null($this->referer))
    {
        curl_setopt ($this->ch, CURLOPT_REFERER, $this->referer);  
    }
    curl_setopt ($this->ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt ($this->ch, CURLOPT_HEADER, 0); 
    if($this->follow)
    {
        curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 1);
    }
    else
    {
        curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 0);
    }
    curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt ($this->ch, CURLOPT_HTTPHEADER, array("Accept: text/html,text/vnd.wap.wml,*.*"));
    curl_setopt ($this->ch, CURLOPT_SSL_VERIFYPEER, FALSE);  // this line makes it work under https

    $try=0;
    $result="";
    while( ($try<=$this->retry_attempts) && (empty($result)) )  // force a retry upto 5 times
    {
        $try++;
        $result = curl_exec($this->ch);
        $this->response=curl_getinfo($this->ch);
        // $response['http_code'] 4xx is an error
    }
    // set refering URL to current url for next page.
    if($this->referer_to_last) $this->set_referer($url);

    return $result; 
}

You are not very specific about the kind of block you're talking. 您对所讨论的障碍物不太明确。 The website in question http://www.habbo.com/home/Intricat does first of all check if the browser has javascript enabled: 有问题的网站http://www.habbo.com/home/Intricat首先检查浏览器是否启用了JavaScript:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    <meta http-equiv="Content-Script-Type" content="text/javascript">
    <script type="text/javascript">function setCookie(c_name, value, expiredays) {
        var exdate = new Date();
        exdate.setDate(exdate.getDate() + expiredays);
        document.cookie = c_name + "=" + escape(value) + ((expiredays == null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/";
    }
    function getHostUri() {
        var loc = document.location;
        return loc.toString();
    }
    setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '179.222.19.192', 10);
    setCookie('DOAReferrer', document.referrer, 10);
    location.href = getHostUri();</script>
</head>
<body>
<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your
    browser.
</noscript>
</body>
</html>

As curl has no javascript support you either need to use a HTTP client that has -or- you need to mimic that script and create the cookie and new request URI your own. 由于curl不支持JavaScript,因此您需要使用具有-或-的HTTP客户端,或者您需要模仿该脚本并创建自己的cookie和新请求URI。

I know this is a very old post, but since I had to answer myself the same question today, here I share it for people coming, it may be of use to them. 我知道这是一个很老的帖子,但是由于我今天不得不回答同样的问题,因此我在这里分享给以后的人们,它可能对他们有用。 I'm also fully aware the OP asked for curl specifically, but --just like me-- there could be people interested in a solution, no matter if curl or not. 我也完全意识到OP特别要求curl ,但是-和我一样-可能有人对解决方案感兴趣,无论是否curl

The page I wanted to get with curl blocked it. 我想要curl的页面阻止了它。 If the block is not because javascript , but because of the agent (that was my case, and setting the agent in curl didn't help), then wget could be a solution: 如果阻止不是因为javascript而是由于代理(那是我的情况,将代理设置为curl并没有帮助),则wget可能是一个解决方案:

wget -o output.txt --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" "http://example.com/page"

You can use 'wget' to access this content using shell.. 您可以使用“ wget”通过shell访问此内容。

function wget($url){

    //get contnet with wget since some sites are not allowed with curl or file_get_content requests

    $content=`wget -O - $url`;

    return $content;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM