简体   繁体   中英

Howto specify own robots.txt rules for LWP::RobotUA

I wrote a script to check my own websites with LWP::RobotUA. I would like to avoid the frequent requests for my robots.txt.

The rules parameter for LWP::RobotUA should allow me to specify those, but I don't qiute understand what should be passed for "allow all pages".

my $ua = LWP::RobotUA->new(agent=>'my-robot/0.1', from=>'me@foo.com', rules=> ??? );

LWP::RobotUA is LWP::UserAgent with robots.txt support added. If you do not want to request robots.txt , then just use LWP::UserAgent.

Alternatively, subclass LWP::RobotUA and override the simple_request method and strip out robots.txt and rule handling.

After more research, I think the intended way to supply robots rules is by subclassing WWW::RobotRules.

{
    package WWW::NoRules;
    use vars qw(@ISA);
    use WWW::RobotRules;
    @ISA = qw(WWW::RobotRules::InCore);

    sub allowed {
        return 1;
    }
}

my $ua = LWP::RobotUA->new(agent=>'my-robot/0.1', from=>'me@foo.com', rules=>WWW::NoRules->new);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM