简体   繁体   中英

Apache2 behind Cloudflare reverse proxy - “URL not available to google” fetch failed: crawl anomaly

Google is unable to crawl my WordPress site behind a Cloudflare reverse proxy with all firewall settings turned off. This is bad - I need it to be able to crawl it.

I'm hosting WordPress on a sub domain (blog.domain.com) and using a Cloudflare reverse proxy to deliver the WordPress content to a subfolder (domain.com/resources). The main domain is hosted with AWS Elastic Beanstalk and directs requests for the blog to the wordpress server via the reverse proxy and works as intended. The browser is able to load the content perfectly fine through the proxy and the only agent that appears to be having an issue is the Googlebot. Google is not being blocked when crawling/accessing blog.domain - it is only being blocked when accessing content through the reverse proxy (domain.com/resources) - This is bad, I don't want it to be blocked. All .htaccess and robots.txt files are allowing all bot traffic and the reverse proxy has all firewall settings turned off. What is preventing google from accessing my blog though the reverse proxy?

Apache2 .htaccess:

# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

# END WordPress

Apache2 robots.txt:

User-agent: *
Allow: /

I'm using stock Apache2 config settings.

Expected result is that googlebot will not be blocked/unable to reach my pages on the domain subfolder (domain.com/resources) which is using a reverse proxy and will ultimately be indexed by the google search engine.

Try to whitelist Google AS numbers in your Cloudflare IP Access Rules . Here are some AS numbers that I found belong to Google. Not sure which one of them are used for crawlers though. Be mindful that if you whitelist the whole AS number, if any IP address from those AS number prove to be malicious (eg attackers that use Google Cloud Compute instances to launch bot attacks etc), Cloudflare can no longer protect your site from it, since they will assume you want to allow those traffic to reach your site.

Google ASN
https://ipinfo.io/AS396982
https://ipinfo.io/AS395973
https://ipinfo.io/AS36385
https://ipinfo.io/AS19527
https://ipinfo.io/AS16591
https://ipinfo.io/AS394699
https://ipinfo.io/AS36492
https://ipinfo.io/AS41264
https://ipinfo.io/AS36040
https://ipinfo.io/AS22577
https://ipinfo.io/AS45566
https://ipinfo.io/AS36384
https://ipinfo.io/AS15169

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM