[英]Howto specify own robots.txt rules for LWP::RobotUA
I wrote a script to check my own websites with LWP::RobotUA. 我编写了一个脚本,使用LWP :: RobotUA检查我自己的网站。 I would like to avoid the frequent requests for my robots.txt.
我希望避免经常请求我的robots.txt。
The rules parameter for LWP::RobotUA should allow me to specify those, but I don't qiute understand what should be passed for "allow all pages". LWP :: RobotUA的rules参数应该允许我指定这些参数,但是我不理解如何为“允许所有页面”传递什么。
my $ua = LWP::RobotUA->new(agent=>'my-robot/0.1', from=>'me@foo.com', rules=> ??? );
LWP::RobotUA is LWP::UserAgent with robots.txt
support added. LWP :: RobotUA是添加了
robots.txt
支持的LWP :: UserAgent。 If you do not want to request robots.txt
, then just use LWP::UserAgent. 如果您不想请求
robots.txt
,则只需使用LWP :: UserAgent。
Alternatively, subclass LWP::RobotUA and override the simple_request
method and strip out robots.txt
and rule handling. 或者,将LWP :: RobotUA子类化,并覆盖
simple_request
方法,并simple_request
robots.txt
和规则处理。
After more research, I think the intended way to supply robots rules is by subclassing WWW::RobotRules. 经过更多研究,我认为提供机器人规则的预期方法是将WWW :: RobotRules子类化。
{
package WWW::NoRules;
use vars qw(@ISA);
use WWW::RobotRules;
@ISA = qw(WWW::RobotRules::InCore);
sub allowed {
return 1;
}
}
my $ua = LWP::RobotUA->new(agent=>'my-robot/0.1', from=>'me@foo.com', rules=>WWW::NoRules->new);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.