如果在本地和云上运行，`requests` 会为同一脚本返回不同的结果

Question

I am trying to send a request to this link and get its HTML content in a Python script.我正在尝试向此链接发送请求并在 Python 脚本中获取其 HTML 内容。 I am using requests library.我正在使用requests库。 The script runs perfectly well locally and gets the HTML content.该脚本在本地运行良好，并获得 HTML 内容。

This script is just a test to make sure everything is working fine before I use another script to scrape that same link.这个脚本只是一个测试，以确保在我使用另一个脚本抓取相同的链接之前一切正常。 This scraper would run for 2-3 days so I obviously wanted to run it on the cloud.这个爬虫会运行 2-3 天，所以我显然想在云上运行它。

So, when I used that test script on a VM instance of Google Cloud Platform, it is getting the contents of a special page of the website which is used for IPs that are blacklisted.因此，当我在 Google Cloud Platform 的 VM 实例上使用该测试脚本时，它正在获取网站的一个特殊页面的内容，该页面用于被列入黑名单的 IP。 So this means that GCP's servers are blacklisted by the website.所以这意味着GCP的服务器被网站列入黑名单。 How can it be blacklisted if I have never used GCP before for web scraping?如果我以前从未使用 GCP 进行 web 抓取，如何将其列入黑名单？

How can I make sure that everything works as expected?我怎样才能确保一切都按预期工作？ Here is my code which checks if the HTML content is for a correct page or the incorrect one (The blacklist page contains the word 'engineers' in the body, so I am searching for that word to identify the page type):这是我的代码，用于检查 HTML 内容是针对正确页面还是不正确页面（黑名单页面的正文中包含“工程师”一词，因此我正在搜索该词以识别页面类型）：

import requests

base = 'https://www.autotrader.com/car-values/'

headers = {}  ## my headers go here. Without them, even locally, the script won't work

r = requests.get(base, headers=headers)
HTML = str(r.content)

if 'engineers' in HTML:
    print('You landed on blacklisted page')
else:
    print('Successful')

I also deployed this same script on Streamlit Cloud and it did not work there too.我还在 Streamlit Cloud 上部署了相同的脚本，但它在那里也不起作用。 So I am guessing it is some kind of protection imposed by the website to deny all cloud servers.所以我猜这是网站拒绝所有云服务器的某种保护。

Please help me...请帮我...

Answer 1

There is 2 reasons有2个原因

Either the websites has already blacklisted all the public IP range of all cloud provider要么网站已经将所有云提供商的所有公共 IP 范围列入黑名单
Or the IP that you use has been already used by another cloud customer, for webscraping, and it has been blacklisted.或者您使用的 IP 已经被其他云客户用于网络抓取，并且已被列入黑名单。

如果在本地和云上运行，`requests` 会为同一脚本返回不同的结果

问题描述

1 个解决方案

解决方案1
2 2020-12-16 19:44:25

如果在本地和云上运行，`requests` 会为同一脚本返回不同的结果

问题描述

1 个解决方案

解决方案1 2 2020-12-16 19:44:25

解决方案1
2 2020-12-16 19:44:25