简体   繁体   English

Robots.txt和Google日历

[英]Robots.txt and Google Calendar

I'm looking for the best solution on how I can ensure I am doing this correctly: 我正在寻找最好的解决方案,我可以确保我正确地做到这一点:

I have a calendar on my website, in which users can take the calendar iCal feed and import it into external calendars of their preference (Outlook, iCal, Google Calendar, etc...). 我的网站上有一个日历,用户可以在其中使用日历iCal Feed并将其导入到他们偏好的外部日历中(Outlook,iCal,Google Calendar等)。

To deter bad people from crawling/searching my website for the *.ics files, I've setup Robots.txt to disallow the folders in which the feeds are stored. 为了阻止坏人抓取/搜索我的网站上的* .ics文件,我设置了Robots.txt以禁止存储订阅源的文件夹。

So, essentially, an iCal feed might look like: webcal://www.mysite.com/feeds/cal/a9d90309dafda390d09/feed.ics 因此,基本上,iCal Feed可能如下所示:webcal://www.mysite.com/feeds/cal/a9d90309dafda390d09/feed.ics

I understand the above is still a public URL. 据我所知,上面仍然是一个公共URL。 However, I have a function in which the user can change address of their feed, if they want. 但是,我有一个功能,用户可以根据需要更改其Feed的地址。

My question is: All external calendars have no problem importing/subscribing to the calendar feed, except for Google Calendar. 我的问题是:除Google日历外,所有外部日历都无法导入/订阅日历Feed。 It throws the message: Google was unable to crawl the URL due to a robots.txt restriction . 它会抛出消息: 由于robots.txt限制,Google无法抓取该网址 Google's Answer to This . 谷歌对此的回答

Consequently, after searching around, I've found that the following works: 因此,搜索后,我发现以下工作:

1) Setup a PHP file (which I am using) that essentially forces a download of the file. 1)设置一个PHP文件(我正在使用),它本质上强制下载文件。 It basically looks like this: 它基本上看起来像这样:

<?php
$url = "/home/path/to/local/feed/".$_GET['url'];
 $file = fopen ($url, "r");
 if (!$file) {
    echo "<p>Unable to open remote file.\n";
    exit;
  }
 while (!feof ($file)) {
  $line = fgets ($file, 1024);
 print $line;
}
fclose($file);
?>

I tried using this script, and it appeared to work with Google Calendar, with no issues. 我尝试使用此脚本,它似乎与Google日历一起使用,没有任何问题。 (Although, I'm not sure if it updates/refreshes yet. I'm still waiting to see if this works). (虽然,我不确定它是否更新/刷新。我还在等着看它是否有效)。

My question is this : Is there a better way to approach such an issue? 我的问题是 :有没有更好的方法来解决这个问题? I'd like to keep the current Robots.txt in place to disallow crawling my directories for *.ics files and keep the files hidden. 我想保留当前的Robots.txt以禁止抓取我的目录以获取* .ics文件并保持文件隐藏。

I recently had this problem and this robots.txt works for me. 我最近有这个问题,这个robots.txt适合我。

User-agent: Googlebot
Allow: /*.ics$
Disallow: /

User-agent: *
Disallow: /

This allows access to any .ics files if they know the address and prevents the bots from searching the site (it's a private server). 这允许访问任何.ics文件,如果他们知道地址并阻止机器人搜索站点(它是私人服务器)。 You will want to change the disallow tag for your server. 您需要更改服务器的disallow标记。

I don't think the allow tag is part of the spec but some bots seem to support it. 我不认为allow标签是规范的一部分,但是一些机器人似乎支持它。 Here is Google's Webmaster Tools help page on robots.txt 以下是robots.txt上Google的网站站长工具帮助页面
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449

Looks to me you have two problems: 在我看来你有两个问题:

  1. Prevent bad-behavioral bots accessing the website. 防止恶意行为机器人访问该网站。
  2. After installing robots.txt, allow Googlebot access your site. 安装robots.txt后,允许Googlebot访问您的网站。

The first problem cannot be solved by robots.txt. robots.txt无法解决第一个问题。 As Marc B points out in comment, robots.txt is a purely voluntary mechanism. 正如Marc B在评论中指出的那样,robots.txt是一种纯粹的自愿机制。 In order to block badbots once for all, I will suggest you using some kind of behavior-analysis program/firewall to detect bad bots and deny access from these IPs. 为了一次性阻止坏机器人,我建议你使用某种行为分析程序/防火墙来检测坏机器人并拒绝来自这些IP的访问。

For the second problem, robots.txt do allow you whitelist a particular bot. 对于第二个问题,robots.txt允许您将特定机器人列入白名单。 Check http://facebook.com/robots.txt as example. http://facebook.com/robots.txt为例。 Noted that Google identify their bots in different names (for Adsence, search, image search, mobile search), I am not if the Google calendar bot uses the generic Google bot name or not. 注意到谷歌以不同的名称识别他们的机器人(用于广告,搜索,图像搜索,移动搜索),如果谷歌日历机器人使用通用的谷歌机器人名称,我不是。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM