简体   繁体   English

检查 url 是否被 robots.txt 使用 Perl 阻止

[英]Check if an url is blocked by robots.txt using Perl

Can anybody tell me sample code to check if an url has been blocked by robots.txt?谁能告诉我示例代码来检查 url 是否已被 robots.txt 阻止? We can specify full url or directory in the robots.txt.我们可以在 robots.txt 中指定完整的 url 或目录。 Is there any helper function in Perl? Perl中有没有帮手function?

Check out WWW::RobotRules :查看WWW::RobotRules

   The following methods are provided:

   $rules = WWW::RobotRules->new($robot_name)
  This is the constructor for WWW::RobotRules objects.  The first
  argument given to new() is the name of the robot.

   $rules->parse($robot_txt_url, $content, $fresh_until)
  The parse() method takes as arguments the URL that was used to
  retrieve the /robots.txt file, and the contents of the file.

   $rules->allowed($uri)
  Returns TRUE if this robot is allowed to retrieve this URL.

WWW::RobotRules is the standard class for parsing robots.txt files and then checking URLs to see if they're blocked. WWW::RobotRules是标准的 class 用于解析 robots.txt 文件,然后检查 URL 以查看它们是否被阻止。

You may also be interested in LWP::RobotUA , which integrates that into LWP::UserAgent , automatically fetching and checking robots.txt files as needed.您可能还对LWP::RobotUA感兴趣,它将其集成到LWP::UserAgent中,根据需要自动获取和检查 robots.txt 文件。

Load the robots.txt file and search for "Disallow:" in the file.加载 robots.txt 文件并在文件中搜索“Disallow:”。 Then check if the following pattern (after the Disallow:) is within your URL.然后检查以下模式(在 Disallow 之后:)是否在您的 URL 中。 If so, the URL is banned by the robots.txt如果是这样,则 URL 被 robots.txt 禁止

Example - You find the following line in the robots.txt:示例 - 您在 robots.txt 中找到以下行:

Disallow: /cgi-bin/禁止:/cgi-bin/

Now remove the "Disallow: " and check, if "/cgi-bin/" (the remaining part) is directly after the TLD.现在删除“Disallow:”并检查“/cgi-bin/”(剩余部分)是否直接位于 TLD 之后。

If your URL looks like:如果您的 URL 看起来像:

www.stackoverflow.com/cgi-bin/somwhatelse.pl www.stackoverflow.com/cgi-bin/somwhatelse.pl

it is banned.它被禁止了。

If your URL looks like:如果您的 URL 看起来像:

www.stackoverflow.com/somwhatelse.pl www.stackoverflow.com/somwhatelse.pl

it is ok.没关系。 The complete set of rules you'll find on http://www.robotstxt.org/ .您可以在http://www.robotstxt.org/上找到完整的规则集。 This is the way, if you can not install additional modules for any reason.如果您因任何原因无法安装其他模块,则可以采用这种方式。

Better would be to use a module from cpan: There is a great module on cpan that I use to deal with it: LWP::RobotUA .最好使用来自 cpan 的模块:cpan 上有一个很棒的模块可以用来处理它: LWP::RobotUA LWP (libwww) is imho the standard for webaccess in perl - and this module is part of it and ensures your behaviour is nice. LWP (libwww) 是 perl 中的网络访问标准恕我直言 - 此模块是其中的一部分,可确保您的行为良好。

WWW::RobotRules skips rules " substring " WWW::RobotRules 跳过规则“ substring

User-agent: *
Disallow: *anytext*

url http://example.com/some_anytext.html be passed (not banned) url http://example.com/some_anytext.html被通过(未禁止)

Hum, you don't seem to have even looked!哼,你好像都没看过! On the first page of search results , I see various download engines that handle robots.txt automatically for you, and at least one that does exactly what you asked.搜索结果的第一页上,我看到各种自动为您处理 robots.txt 的下载引擎,并且至少有一个完全按照您的要求执行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM