简体   繁体   English

Cloudflare反向代理后面的Apache2-“ URL不可用于Google”获取失败:抓取异常

[英]Apache2 behind Cloudflare reverse proxy - “URL not available to google” fetch failed: crawl anomaly

Google is unable to crawl my WordPress site behind a Cloudflare reverse proxy with all firewall settings turned off. 在关闭所有防火墙设置的情况下,Google无法在Cloudflare反向代理后面抓取我的WordPress网站。 This is bad - I need it to be able to crawl it. 这很不好-我需要它能够爬行。

I'm hosting WordPress on a sub domain (blog.domain.com) and using a Cloudflare reverse proxy to deliver the WordPress content to a subfolder (domain.com/resources). 我将WordPress托管在子域(blog.domain.com)上,并使用Cloudflare反向代理将WordPress内容传递到子文件夹(domain.com/resources)。 The main domain is hosted with AWS Elastic Beanstalk and directs requests for the blog to the wordpress server via the reverse proxy and works as intended. 主域托管在AWS Elastic Beanstalk中,并通过反向代理将对博客的请求定向到wordpress服务器,并按预期工作。 The browser is able to load the content perfectly fine through the proxy and the only agent that appears to be having an issue is the Googlebot. 浏览器可以通过代理完美地加载内容,唯一出现问题的代理是Googlebot。 Google is not being blocked when crawling/accessing blog.domain - it is only being blocked when accessing content through the reverse proxy (domain.com/resources) - This is bad, I don't want it to be blocked. 抓取/访问blog.domain时不会阻止Google-仅当通过反向代理(domain.com/resources)访问内容时才阻止Google-这很糟糕,我不希望它被阻止。 All .htaccess and robots.txt files are allowing all bot traffic and the reverse proxy has all firewall settings turned off. 所有.htaccess和robots.txt文件都允许所有漫游器流量,并且反向代理关闭了所有防火墙设置。 What is preventing google from accessing my blog though the reverse proxy? 是什么阻止了Google通过反向代理访问我的博客?

Apache2 .htaccess: Apache2 .htaccess:

# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

# END WordPress

Apache2 robots.txt: Apache2 robots.txt:

User-agent: *
Allow: /

I'm using stock Apache2 config settings. 我正在使用库存Apache2配置设置。

Expected result is that googlebot will not be blocked/unable to reach my pages on the domain subfolder (domain.com/resources) which is using a reverse proxy and will ultimately be indexed by the google search engine. 预期结果是googlebot将不会被阻止/无法访问我在使用反向代理的域子文件夹(domain.com/resources)上的页面,最终将被google搜索引擎索引。

Try to whitelist Google AS numbers in your Cloudflare IP Access Rules . 尝试将Cloudflare IP访问规则中的 Google AS编号列入白名单。 Here are some AS numbers that I found belong to Google. 这是我发现的一些AS编号,属于Google。 Not sure which one of them are used for crawlers though. 虽然不确定其中哪一个用于爬虫。 Be mindful that if you whitelist the whole AS number, if any IP address from those AS number prove to be malicious (eg attackers that use Google Cloud Compute instances to launch bot attacks etc), Cloudflare can no longer protect your site from it, since they will assume you want to allow those traffic to reach your site. 请注意,如果您将整个AS号列入白名单,如果该AS号中的任何IP地址被证明是恶意的(例如,使用Google Cloud Compute实例发起漫游器攻击的攻击者等),Cloudflare将无法再保护您的网站,因为他们会假设您要允许这些流量到达您的网站。

Google ASN
https://ipinfo.io/AS396982
https://ipinfo.io/AS395973
https://ipinfo.io/AS36385
https://ipinfo.io/AS19527
https://ipinfo.io/AS16591
https://ipinfo.io/AS394699
https://ipinfo.io/AS36492
https://ipinfo.io/AS41264
https://ipinfo.io/AS36040
https://ipinfo.io/AS22577
https://ipinfo.io/AS45566
https://ipinfo.io/AS36384
https://ipinfo.io/AS15169

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM