[英]How to get Content-Length or CRC in header of a web page?
I wrote a crawler for spesific dynamic website. 我为特殊的动态网站编写了一个搜寻器。 All crawl jobs taking over 3 hours.
所有抓取作业都需要3个小时以上。 I want to control the page is already crawled or there are some changes on page.
我想控制页面已经被抓取或页面上有一些更改。 If i can do this the script will be completed in very short time.
如果我能做到这一点,该脚本将在很短的时间内完成。
for example: 例如:
foreach ($urls as $url) {
if(thereAreChanges($url)){
crawl($url);
}
}
Information: The web page doesn't provide content-length and crc. 信息:网页不提供内容长度和crc。
Array ( [0] => HTTP/1.1 200 OK
[Date] => Tue, 08 Jan 2013 07:47:03 GMT
[Server] => Apache
[Set-Cookie] => Array (
[0] => PHPSESSID=eisb6qjme9b0ouoga9su9fgok4; path=/
[1] => j12011=a%3A3%3A%7Bs%3A3%3A%22sid%22%3Bs%3A26%3A%22eisb6qjme9b0ouoga9su9fgok4%22%3Bs%3A2%3A%22ip%22%3Bs%3A12%3A%2294.103.47.65%22%3Bs%3A4%3A%22time%22%3Bi%3A1357631223%3B%7D; expires=Sat, 09-Mar-2013 07:47:03 GMT; path=/
)
[Expires] => Thu, 19 Nov 1981 08:52:00 GMT
[Cache-Control] => no-store, no-cache, must-revalidate, post-check=0, pre-check=0
[Pragma] => no-cache
[Vary] => Accept-Encoding
[Connection] => close
[Content-Type] => text/html
)
The site provides Content-Type but doesnt provide Content-Length. 该站点提供Content-Type,但不提供Content-Length。 How can i ask content-length to apache.
我怎样才能要求内容长度为apache。
Update : http://urivalet.com/ can get content-length. 更新: http : //urivalet.com/可以获取内容长度。 I need this.
我需要这个。
If i can get CRC code of page in header. 如果我可以在页眉中获取页面的CRC代码。 It will be perfect.
这将是完美的。 But I guess this is long shot.
但是我想这是远景。
In the function thereAreChanges($url)
You can do the following, 在函数
thereAreChanges($url)
您可以执行以下操作:
If-modified-since
header with the last time you visit the page as parameter. If-modified-since
标头作为参数。 This header will return 304
status code if its not modified. 304
状态代码。 Last-Modified
header of the response and compare with your current stored page's last modified date. Last-Modified
标头,并与当前存储页面的上次修改日期进行比较。 If the content is newer fetch it. HEAD
request with those headers. HEAD
请求。 GET
will give you all the content. GET
将为您提供所有内容。 But HEAD
will just return headers. HEAD
只会返回标头。 For such query only headers are needed. Its better to use existing crawler and search engine framework than writing one. 使用现有的搜寻器和搜索引擎框架比编写框架要好。
Use Apaches Nutch to crawl webpages, Solr to search the indexed pages. 使用Apaches Nutch爬网网页,使用Solr搜索索引页面。 Solr provides a HTTP interface where you can run query by PHP.
Solr提供了一个HTTP接口,您可以在其中通过PHP运行查询。 For more flexibility you can use Lucene .
为了获得更大的灵活性,您可以使用Lucene 。
Here is a tutorial on how to setup Nutch and Solr. 这是有关如何设置Nutch和Solr的教程 。
Solution is 'header'=>"Accept-Encoding: gzip"
解决方案是
'header'=>"Accept-Encoding: gzip"
That's why header doesn't return Content-Length, with this parameter page returns content-length. 这就是标题不返回Content-Length的原因,使用此参数页面返回content-length。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.