檢查 url 是否被 robots.txt 使用 Perl 阻止

Question

誰能告訴我示例代碼來檢查 url 是否已被 robots.txt 阻止？ 我們可以在 robots.txt 中指定完整的 url 或目錄。 Perl中有沒有幫手function？

Answer 1

查看WWW::RobotRules ：

   The following methods are provided:

   $rules = WWW::RobotRules->new($robot_name)
  This is the constructor for WWW::RobotRules objects.  The first
  argument given to new() is the name of the robot.

   $rules->parse($robot_txt_url, $content, $fresh_until)
  The parse() method takes as arguments the URL that was used to
  retrieve the /robots.txt file, and the contents of the file.

   $rules->allowed($uri)
  Returns TRUE if this robot is allowed to retrieve this URL.

Answer 2

WWW::RobotRules是標准的 class 用於解析 robots.txt 文件，然后檢查 URL 以查看它們是否被阻止。

您可能還對LWP::RobotUA感興趣，它將其集成到LWP::UserAgent中，根據需要自動獲取和檢查 robots.txt 文件。

Answer 3

加載 robots.txt 文件並在文件中搜索“Disallow:”。 然后檢查以下模式（在 Disallow 之后：）是否在您的 URL 中。 如果是這樣，則 URL 被 robots.txt 禁止

示例 - 您在 robots.txt 中找到以下行：

禁止：/cgi-bin/

現在刪除“Disallow:”並檢查“/cgi-bin/”（剩余部分）是否直接位於 TLD 之后。

如果您的 URL 看起來像：

www.stackoverflow.com/cgi-bin/somwhatelse.pl

它被禁止了。

如果您的 URL 看起來像：

www.stackoverflow.com/somwhatelse.pl

沒關系。 您可以在http://www.robotstxt.org/上找到完整的規則集。 如果您因任何原因無法安裝其他模塊，則可以采用這種方式。

最好使用來自 cpan 的模塊：cpan 上有一個很棒的模塊可以用來處理它： LWP::RobotUA 。 LWP (libwww) 是 perl 中的網絡訪問標准恕我直言 - 此模塊是其中的一部分，可確保您的行為良好。

Answer 4

WWW::RobotRules 跳過規則“ substring ”

User-agent: *
Disallow: *anytext*

url http://example.com/some_anytext.html被通過（未禁止）

Answer 5

哼，你好像都沒看過！ 在搜索結果的第一頁上，我看到各種自動為您處理 robots.txt 的下載引擎，並且至少有一個完全按照您的要求執行。

檢查 url 是否被 robots.txt 使用 Perl 阻止

問題描述

5 個解決方案

解決方案1
4 已采納 2011-05-31 08:11:59

解決方案2
3 2011-05-31 08:14:39

解決方案3
1 2011-05-31 08:05:57

解決方案4
0 2015-05-23 22:47:36

解決方案5
0 2011-05-31 08:09:11

檢查 url 是否被 robots.txt 使用 Perl 阻止

問題描述

5 個解決方案

解決方案1 4 已采納 2011-05-31 08:11:59

解決方案2 3 2011-05-31 08:14:39

解決方案3 1 2011-05-31 08:05:57

解決方案4 0 2015-05-23 22:47:36

解決方案5 0 2011-05-31 08:09:11

解決方案1
4 已采納 2011-05-31 08:11:59

解決方案2
3 2011-05-31 08:14:39

解決方案3
1 2011-05-31 08:05:57

解決方案4
0 2015-05-23 22:47:36

解決方案5
0 2011-05-31 08:09:11