[英]`requests` returning different results for the same script if run both locally and on the cloud
I am trying to send a request to this link and get its HTML content in a Python script.我正在尝试向此链接发送请求并在 Python 脚本中获取其 HTML 内容。 I am using
requests
library.我正在使用
requests
库。 The script runs perfectly well locally and gets the HTML content.该脚本在本地运行良好,并获得 HTML 内容。
This script is just a test to make sure everything is working fine before I use another script to scrape that same link.这个脚本只是一个测试,以确保在我使用另一个脚本抓取相同的链接之前一切正常。 This scraper would run for 2-3 days so I obviously wanted to run it on the cloud.
这个爬虫会运行 2-3 天,所以我显然想在云上运行它。
So, when I used that test script on a VM instance of Google Cloud Platform, it is getting the contents of a special page of the website which is used for IPs that are blacklisted.因此,当我在 Google Cloud Platform 的 VM 实例上使用该测试脚本时,它正在获取网站的一个特殊页面的内容,该页面用于被列入黑名单的 IP。 So this means that GCP's servers are blacklisted by the website.
所以这意味着GCP的服务器被网站列入黑名单。 How can it be blacklisted if I have never used GCP before for web scraping?
如果我以前从未使用 GCP 进行 web 抓取,如何将其列入黑名单?
How can I make sure that everything works as expected?我怎样才能确保一切都按预期工作? Here is my code which checks if the HTML content is for a correct page or the incorrect one (The blacklist page contains the word 'engineers' in the body, so I am searching for that word to identify the page type):
这是我的代码,用于检查 HTML 内容是针对正确页面还是不正确页面(黑名单页面的正文中包含“工程师”一词,因此我正在搜索该词以识别页面类型):
import requests
base = 'https://www.autotrader.com/car-values/'
headers = {} ## my headers go here. Without them, even locally, the script won't work
r = requests.get(base, headers=headers)
HTML = str(r.content)
if 'engineers' in HTML:
print('You landed on blacklisted page')
else:
print('Successful')
I also deployed this same script on Streamlit Cloud and it did not work there too.我还在 Streamlit Cloud 上部署了相同的脚本,但它在那里也不起作用。 So I am guessing it is some kind of protection imposed by the website to deny all cloud servers.
所以我猜这是网站拒绝所有云服务器的某种保护。
Please help me...请帮我...
There is 2 reasons有2个原因
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.