如何为LWP :: RobotUA指定自己的robots.txt规则

Question

I wrote a script to check my own websites with LWP::RobotUA. 我编写了一个脚本，使用LWP :: RobotUA检查我自己的网站。 I would like to avoid the frequent requests for my robots.txt. 我希望避免经常请求我的robots.txt。

The rules parameter for LWP::RobotUA should allow me to specify those, but I don't qiute understand what should be passed for "allow all pages". LWP :: RobotUA的rules参数应该允许我指定这些参数，但是我不理解如何为“允许所有页面”传递什么。

my $ua = LWP::RobotUA->new(agent=>'my-robot/0.1', from=>'me@foo.com', rules=> ??? );

Answer 1

LWP::RobotUA is LWP::UserAgent with robots.txt support added. LWP :: RobotUA是添加了robots.txt支持的LWP :: UserAgent。 If you do not want to request robots.txt , then just use LWP::UserAgent. 如果您不想请求robots.txt ，则只需使用LWP :: UserAgent。

Alternatively, subclass LWP::RobotUA and override the simple_request method and strip out robots.txt and rule handling. 或者，将LWP :: RobotUA子类化，并覆盖simple_request方法，并simple_request robots.txt和规则处理。

Answer 2

After more research, I think the intended way to supply robots rules is by subclassing WWW::RobotRules. 经过更多研究，我认为提供机器人规则的预期方法是将WWW :: RobotRules子类化。

{
    package WWW::NoRules;
    use vars qw(@ISA);
    use WWW::RobotRules;
    @ISA = qw(WWW::RobotRules::InCore);

    sub allowed {
        return 1;
    }
}

my $ua = LWP::RobotUA->new(agent=>'my-robot/0.1', from=>'me@foo.com', rules=>WWW::NoRules->new);

如何为LWP :: RobotUA指定自己的robots.txt规则

问题描述

2 个解决方案

解决方案1
1 2011-12-22 10:30:27

解决方案2
0 已采纳 2011-12-22 11:38:04

如何为LWP :: RobotUA指定自己的robots.txt规则

问题描述

2 个解决方案

解决方案1 1 2011-12-22 10:30:27

解决方案2 0 已采纳 2011-12-22 11:38:04

解决方案1
1 2011-12-22 10:30:27

解决方案2
0 已采纳 2011-12-22 11:38:04