简体   繁体   中英

How to get Content-Length or CRC in header of a web page?

I wrote a crawler for spesific dynamic website. All crawl jobs taking over 3 hours. I want to control the page is already crawled or there are some changes on page. If i can do this the script will be completed in very short time.

for example:

    foreach ($urls as $url) {
        if(thereAreChanges($url)){
            crawl($url);
        }
    }

Information: The web page doesn't provide content-length and crc.

Array ( [0] => HTTP/1.1 200 OK 
        [Date] => Tue, 08 Jan 2013 07:47:03 GMT 
        [Server] => Apache 
        [Set-Cookie] => Array ( 
                [0] => PHPSESSID=eisb6qjme9b0ouoga9su9fgok4; path=/  
                [1] => j12011=a%3A3%3A%7Bs%3A3%3A%22sid%22%3Bs%3A26%3A%22eisb6qjme9b0ouoga9su9fgok4%22%3Bs%3A2%3A%22ip%22%3Bs%3A12%3A%2294.103.47.65%22%3Bs%3A4%3A%22time%22%3Bi%3A1357631223%3B%7D; expires=Sat, 09-Mar-2013 07:47:03 GMT; path=/  
        ) 
        [Expires] => Thu, 19 Nov 1981 08:52:00 GMT 
        [Cache-Control] => no-store, no-cache, must-revalidate, post-check=0, pre-check=0 
        [Pragma] => no-cache 
        [Vary] => Accept-Encoding 
        [Connection] => close 
        [Content-Type] => text/html 
)

The site provides Content-Type but doesnt provide Content-Length. How can i ask content-length to apache.

Update : http://urivalet.com/ can get content-length. I need this.

If i can get CRC code of page in header. It will be perfect. But I guess this is long shot.

In the function thereAreChanges($url) You can do the following,

  1. When send a request send If-modified-since header with the last time you visit the page as parameter. This header will return 304 status code if its not modified.
  2. Check the Last-Modified header of the response and compare with your current stored page's last modified date. If the content is newer fetch it.
  3. If possible perform HEAD request with those headers. GET will give you all the content. But HEAD will just return headers. For such query only headers are needed.
  4. 4.

Its better to use existing crawler and search engine framework than writing one.

Use Apaches Nutch to crawl webpages, Solr to search the indexed pages. Solr provides a HTTP interface where you can run query by PHP. For more flexibility you can use Lucene .

Here is a tutorial on how to setup Nutch and Solr.

Solution is 'header'=>"Accept-Encoding: gzip"

That's why header doesn't return Content-Length, with this parameter page returns content-length.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM