简体   繁体   English

跟踪页面标题并使用php-libcurl重定向

[英]tracking page headers and redirects with php-libcurl

I was writing a script to track headers especially redirects and cookies for a url. 我正在编写一个脚本来跟踪标头,特别是URL的重定向和cookie。 Many times when i open a url it redirects to another url or sometimes more than one url and also stores some cookies. 很多时候,当我打开一个URL时,它会重定向到另一个URL或多个URL,有时还会存储一些Cookie。 But when i ran the script with url 但是当我用url运行脚本时

http://en.wikipedia.org/ http://en.wikipedia.org/

my script didnt save cookies and it only showed one redirect and didnt store any cookies. 我的脚本没有保存cookie,并且只显示了一个重定向并且没有存储任何cookie。 but when i browsed the url in firefox it saved cookies and when i inspected it with Live HTTP Headers it showed multiple get requests. 但是,当我在firefox中浏览url时,它保存了cookie,当我使用Live HTTP Headers检查时,它显示了多个get请求。 Live HTTP Headers also shows that there are Set-Cookie headers. 实时HTTP标头还显示有Set-Cookie标头。

<?php

$url="http://en.wikipedia.org/";
$userAgent="Mozilla/5.0 (Windows NT 5.1; rv:2.0)Gecko/20100101 Firefox/4.0";
$accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$encoding="gzip, deflate";
$header['lang']="en-us,en;q=0.5";
$header['charset']="ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header['conn']="keep-alive";
$header['keep-alive']=115;
$i=1;
$flag=1;        //0 if there is no redirect i.e. no location header to follow. used here to to control the while loop below

while($flag!=0) {
    $ch=curl_init();
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_USERAGENT,$userAgent);
    curl_setopt($ch,CURLOPT_ENCODING,$encoding);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION,0);
    curl_setopt($ch,CURLOPT_HEADER,1);
    curl_setopt($ch,CURLOPT_NOBODY,1);
    curl_setopt($ch,CURLOPT_AUTOREFERER,true);
    curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookie.txt");
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    $pageHeader[$i]=curl_exec($ch);
    curl_close($ch);
    $flag=preg_match('/Location: (.*)\s/',$pageHeader[$i],$location[$i]);
    if($flag==1) {      //if there is a location header    
        if(preg_match('@^(http://|www.)@',$location[$i][1],$tempurl)==1) {      //if it is an absolute url
            $url=$location[$i][1];
        } else {
            if(preg_match('@^/(.*)@',$location[$i][1],$tempurl)==1) {   //if the url corresponds to url relative to server's root
                preg_match('@^((http://)|(www.))[^/]+@',$url,$domain);
                $url=$domain.$tempurl[0];
            } else {        //if the url is relative to current directory
                $url=preg_replace('@(/[^/]+)$@',"/".$location[$i][1],$url);
            }
        }
        $location[$i]=$url;
        preg_match('/Set-Cookie: (.*)\s/',$pageHeader[$i],$cookie[$i]);
        $i++;
    }

    foreach($location as $l)
        $loc=$loc.$l."\n";

    $header=implode("\n\n\n",$pageHeader);
    file_put_contents(dirname(__FILE__) . "/location.txt",$loc);
    file_put_contents(dirname(__FILE__) . "/header.txt",$header);
?>

here the file location.txt and header.txt are created but cookie.txt are not created. 此处创建了文件location.txtheader.txt ,但未创建cookie.txt if i change the url to google.com then it shows the redirect to google.co.in in the location.txt file and it saves a cookie in the cookie.txt file. 如果我将网址更改为google.com,则它将在location.txt文件中显示到google.co.in的重定向,并将cookie保存在cookie.txt文件中。 But when i open google.com in Firefox it saves three cookies. 但是,当我在Firefox打开google.com ,它会保存三个cookie。 What can be wrong? 有什么事吗 I think there is some javascript on the page that is setting the cookies so curl is not able to get that. 我认为页面上有一些设置cookie的javascript,所以curl无法获得它。 also any suggestions for the improvement of above code are welcome 也欢迎对上述代码进行改进的任何建议

您的位置:以下代码已完全损坏,因为您应该已经看到大多数HTTP重定向是相对的,因此您不能仅在后续请求中将该字符串用作URL。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM