简体   繁体   English

如果在循环中,php curl返回400错误请求

[英]php curl returns 400 Bad Request if does in a loop

I'm trying to do a screen scrape using the cUrl library. 我正在尝试使用cUrl库进行屏幕抓取。

I managed to successfully screen scrape, few urls(5-10). 我设法成功地筛选抓取,几个网址(5-10)。

However whenever i run it in a for loop scraping a bulk(10-20) urls, 但是,每当我在for循环中运行它以抓取大量(10-20)网址时,

it will reach a point the last few urls will returns "HTTP/1.1 400 Bad Request". 它将到达最后一个URL将返回“ HTTP / 1.1 400 Bad Request”的地步。 Your browser sent a request that this server could not understand. 您的浏览器发送了该服务器无法理解的请求。
The number of request header fields exceeds this server's limit. 请求标头字段的数量超出了此服务器的限制。

I'm pretty sure the urls are correct and correctly trimmed and the headers length are the same individually. 我非常确定网址正确无误,并且标题长度分别相同。 If i put these last few urls on top of the list to scrape, it does go through, but the last few of the list again gets the 400 Bad request error. 如果我将这些最后几个URL放在列表顶部以进行刮擦,它确实会通过,但列表的最后几个URL再次出现400错误的请求错误。 What could be the problem? 可能是什么问题呢? What could be the cause? 可能是什么原因?

Any advise ? 有什么建议吗?

Something like below: 如下所示:

for($i=0;$i > sizeof($url);$i++)    
$data[$i] = $this->get($url[$i]); 



function get($url) {

$this->headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
        $this->headers[] = 'Connection: Keep-Alive';
        $this->headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
        $this->user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 (.NET CLR 3.5.30729)';

set_time_limit(EXECUTION_TIME_LIMIT);
        $default_exec_time = ini_get('max_execution_time');

        $this->redirectcount = 0;
        $process = curl_init($url);
        curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
        curl_setopt($process, CURLOPT_HEADER, 1);
        curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
        if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
        if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);

        //off compression for debugging's sake
        //curl_setopt($process,CURLOPT_ENCODING , $this->compression);

        curl_setopt($process, CURLOPT_TIMEOUT, 180);
        if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy);
        if ($this->proxyauth){ 
            curl_setopt($process, CURLOPT_HTTPPROXYTUNNEL, 1); 
            curl_setopt($process, CURLOPT_PROXYUSERPWD, $this->proxyauth);  
         }
        curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($process, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($process,CURLOPT_MAXREDIRS,10); 

        //added
        //curl_setopt($process, CURLOPT_AUTOREFERER, 1);
        curl_setopt($process,CURLOPT_VERBOSE,TRUE);
        if ($this->referrer) curl_setopt($process,CURLOPT_REFERER,$this->referrer);

        if($this->cookies){
            foreach($this->cookies as $cookie){
                curl_setopt ($process, CURLOPT_COOKIE, $cookie);
                //echo $cookie; 
            }
        }

        $return = $this->redirect_exec($process);//curl_exec($process) or curl_error($process);
        curl_close($process);
        set_time_limit($default_exec_time);//setback to default

        return $return;
    }

    function redirect_exec($ch, $curlopt_header = false) {

    //curl_setopt($ch, CURLOPT_HEADER, true);
    //curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $data = curl_exec($ch);
    $file = fopen(DP_SCRAPE_DATA_CURL_DIR.$this->redirectcount.".html","w");
    fwrite($file,$data);
    fclose($file);

    $info =    curl_getinfo($ch);
    print_r($info);echo "
"; $http_code = $info['http_code']; if ($http_code == 301 || $http_code == 302 || $http_code == 303) { //list($header) = explode("\r\n\r\n", $data); //print_r($header); $matches = array(); //print_r($data); //Check if the response has a Location to redirect to preg_match('/(Location:|URI:)(.*?)\n/', $data, $matches); $url = trim(array_pop($matches)); //print_r($url); $url_parsed = parse_url($url); //print_r($url_parsed); if (isset($url_parsed['path']) && isset($url) && !empty($url) ) { //echo "
".$url; curl_setopt($ch, CURLOPT_URL, MY_HOST.$url); //echo "
".$url; $this->redirectcount++; return $this->redirect_exec($ch); //return $this->get(MY_HOST.$url); //$this->redirect_exec($ch); } } elseif($http_code == 200){ $matches = array(); preg_match('/(/i', $data, $matches); //print_r($matches); $url = trim(array_pop($matches)); //print_r($url); $url_parsed = parse_url($url); //print_r($url_parsed); if (isset($url_parsed['path']) && isset($url) && !empty($url) ) { curl_setopt($ch, CURLOPT_URL, $url); //echo "
".$url; $this->redirectcount++; sleep(SLEEP_INTERVAL); return $this->redirect_exec($ch); //return $this->get($url); //$this->redirect_exec($ch); } } //echo "data ".$data; $this->redirectcount++; return $data ; // $info['url']; }

where $urls are all the urls containing all query string for a get request 其中$ urls是包含get请求的所有查询字符串的所有urls

i realised from curl_getinfo , the [request_size ] is getting larger and larger which it shouldnt be.. it should be about the same size. 我从curl_getinfo意识到,[request_size]越来越大,它不应该是..它的大小应该差不多。 How can i print/echo my http request information to debug? 我如何打印/回显我的http请求信息以进行调试?

Your problem regarding the multiplying headers is at the top of the get method: 您与乘法标头有关的问题位于get方法的顶部:

$this->headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$this->headers[] = 'Connection: Keep-Alive';
$this->headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';

On each iteration you are adding the same headers to the headers array of the object instance. 在每次迭代中,您都将相同的标头添加到对象实例的headers数组中。 (Saying array[] appends to the array.) You need to either reset the array on each iteration or perhaps move the headers setting into another method. (说array[]追加到数组。)您需要在每次迭代时重置数组,或者可能将标头设置移动到另一个方法中。

If headers is always and only set in the get method, you can change it to this in order to fix the problem: 如果headers始终且仅在get方法中设置,则可以将其更改为此,以解决问题:

$this->headers = array(
    'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg',
    'Connection: Keep-Alive',
    'Content-type: application/x-www-form-urlencoded;charset=UTF-8'
);

...but if the headers are always the same and never changed between iterations, you might as well set the headers' value in the object constructor and only read from it in the get method, since resetting the array to the same value all the time is redundant. ...但是如果标头始终是相同的,并且在两次迭代之间从未更改,则最好在对象构造函数中设置标头的值,并仅在get方法中读取标头的值,因为将所有数组都重置为相同的值时间是多余的。

Setting CURLINFO_HEADER_OUT to true, I am able to retrieve the request information sent. CURLINFO_HEADER_OUT设置为true,我可以检索发送的请求信息。

Indeed, the request headers gets more and more information 确实,请求标头获得了越来越多的信息

I particularly has this header incrementing! 我特别增加了此标头!

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg
Connection: Keep-Alive
Content-type: application/x-www-form-urlencoded;charset=UTF-8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg
Connection: Keep-Alive
Content-type: application/x-www-form-urlencoded;charset=UTF-8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg
Connection: Keep-Alive
Content-type: application/x-www-form-urlencoded;charset=UTF-8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg
Connection: Keep-Alive
Content-type: application/x-www-form-urlencoded;charset=UTF-8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM