简体   繁体   English

如何在网页标题中获取Content-Length或CRC?

[英]How to get Content-Length or CRC in header of a web page?

I wrote a crawler for spesific dynamic website. 我为特殊的动态网站编写了一个搜寻器。 All crawl jobs taking over 3 hours. 所有抓取作业都需要3个小时以上。 I want to control the page is already crawled or there are some changes on page. 我想控制页面已经被抓取或页面上有一些更改。 If i can do this the script will be completed in very short time. 如果我能做到这一点,该脚本将在很短的时间内完成。

for example: 例如:

    foreach ($urls as $url) {
        if(thereAreChanges($url)){
            crawl($url);
        }
    }

Information: The web page doesn't provide content-length and crc. 信息:网页不提供内容长度和crc。

Array ( [0] => HTTP/1.1 200 OK 
        [Date] => Tue, 08 Jan 2013 07:47:03 GMT 
        [Server] => Apache 
        [Set-Cookie] => Array ( 
                [0] => PHPSESSID=eisb6qjme9b0ouoga9su9fgok4; path=/  
                [1] => j12011=a%3A3%3A%7Bs%3A3%3A%22sid%22%3Bs%3A26%3A%22eisb6qjme9b0ouoga9su9fgok4%22%3Bs%3A2%3A%22ip%22%3Bs%3A12%3A%2294.103.47.65%22%3Bs%3A4%3A%22time%22%3Bi%3A1357631223%3B%7D; expires=Sat, 09-Mar-2013 07:47:03 GMT; path=/  
        ) 
        [Expires] => Thu, 19 Nov 1981 08:52:00 GMT 
        [Cache-Control] => no-store, no-cache, must-revalidate, post-check=0, pre-check=0 
        [Pragma] => no-cache 
        [Vary] => Accept-Encoding 
        [Connection] => close 
        [Content-Type] => text/html 
)

The site provides Content-Type but doesnt provide Content-Length. 该站点提供Content-Type,但不提供Content-Length。 How can i ask content-length to apache. 我怎样才能要求内容长度为apache。

Update : http://urivalet.com/ can get content-length. 更新: http : //urivalet.com/可以获取内容长度。 I need this. 我需要这个。

If i can get CRC code of page in header. 如果我可以在页眉中获取页面的CRC代码。 It will be perfect. 这将是完美的。 But I guess this is long shot. 但是我想这是远景。

In the function thereAreChanges($url) You can do the following, 在函数thereAreChanges($url)您可以执行以下操作:

  1. When send a request send If-modified-since header with the last time you visit the page as parameter. 发送请求时,发送带有上次您访问该页面的If-modified-since标头作为参数。 This header will return 304 status code if its not modified. 如果未修改,则此标头将返回304状态代码。
  2. Check the Last-Modified header of the response and compare with your current stored page's last modified date. 检查响应的Last-Modified标头,并与当前存储页面的上次修改日期进行比较。 If the content is newer fetch it. 如果内容较新,请获取它。
  3. If possible perform HEAD request with those headers. 如果可能,请使用这些标头执行HEAD请求。 GET will give you all the content. GET将为您提供所有内容。 But HEAD will just return headers. 但是HEAD只会返回标头。 For such query only headers are needed. 对于此类查询,仅需要标题。
  4. 4. 4。

Its better to use existing crawler and search engine framework than writing one. 使用现有的搜寻器和搜索引擎框架比编写框架要好。

Use Apaches Nutch to crawl webpages, Solr to search the indexed pages. 使用Apaches Nutch爬网网页,使用Solr搜索索引页面。 Solr provides a HTTP interface where you can run query by PHP. Solr提供了一个HTTP接口,您可以在其中通过PHP运行查询。 For more flexibility you can use Lucene . 为了获得更大的灵活性,您可以使用Lucene

Here is a tutorial on how to setup Nutch and Solr. 这是有关如何设置Nutch和Solr的教程

Solution is 'header'=>"Accept-Encoding: gzip" 解决方案是'header'=>"Accept-Encoding: gzip"

That's why header doesn't return Content-Length, with this parameter page returns content-length. 这就是标题不返回Content-Length的原因,使用此参数页面返回content-length。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM