简体   繁体   English

如果在本地和云上运行,`requests` 会为同一脚本返回不同的结果

[英]`requests` returning different results for the same script if run both locally and on the cloud

I am trying to send a request to this link and get its HTML content in a Python script.我正在尝试向链接发送请求并在 Python 脚本中获取其 HTML 内容。 I am using requests library.我正在使用requests库。 The script runs perfectly well locally and gets the HTML content.该脚本在本地运行良好,并获得 HTML 内容。

This script is just a test to make sure everything is working fine before I use another script to scrape that same link.这个脚本只是一个测试,以确保在我使用另一个脚本抓取相同的链接之前一切正常。 This scraper would run for 2-3 days so I obviously wanted to run it on the cloud.这个爬虫会运行 2-3 天,所以我显然想在云上运行它。

So, when I used that test script on a VM instance of Google Cloud Platform, it is getting the contents of a special page of the website which is used for IPs that are blacklisted.因此,当我在 Google Cloud Platform 的 VM 实例上使用该测试脚本时,它正在获取网站的一个特殊页面的内容,该页面用于被列入黑名单的 IP。 So this means that GCP's servers are blacklisted by the website.所以这意味着GCP的服务器被网站列入黑名单。 How can it be blacklisted if I have never used GCP before for web scraping?如果我以前从未使用 GCP 进行 web 抓取,如何将其列入黑名单?

How can I make sure that everything works as expected?我怎样才能确保一切都按预期工作? Here is my code which checks if the HTML content is for a correct page or the incorrect one (The blacklist page contains the word 'engineers' in the body, so I am searching for that word to identify the page type):这是我的代码,用于检查 HTML 内容是针对正确页面还是不正确页面(黑名单页面的正文中包含“工程师”一词,因此我正在搜索该词以识别页面类型):

import requests

base = 'https://www.autotrader.com/car-values/'

headers = {}  ## my headers go here. Without them, even locally, the script won't work

r = requests.get(base, headers=headers)
HTML = str(r.content)

if 'engineers' in HTML:
    print('You landed on blacklisted page')
else:
    print('Successful')

I also deployed this same script on Streamlit Cloud and it did not work there too.我还在 Streamlit Cloud 上部署了相同的脚本,但它在那里也不起作用。 So I am guessing it is some kind of protection imposed by the website to deny all cloud servers.所以我猜这是网站拒绝所有云服务器的某种保护。

Please help me...请帮我...

There is 2 reasons有2个原因

  • Either the websites has already blacklisted all the public IP range of all cloud provider要么网站已经将所有云提供商的所有公共 IP 范围列入黑名单
  • Or the IP that you use has been already used by another cloud customer, for webscraping, and it has been blacklisted.或者您使用的 IP 已经被其他云客户用于网络抓取,并且已被列入黑名单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 相同的python脚本,但结果不同 - same python script, but different results Function 使用相同的输入返回不同的结果 - Function Returning Different Results with Same Inputs 如何从云运行调用相同的云运行以并行运行请求? - how to invoke the same cloud run from a cloud run to run requests parallely? Python-Django项目,可以在云(gae)和本地运行,无需互联网连接 - Python-Django project that can run both on cloud (gae) and locally without internet connection 运行两次相同的循环,但结果不同 - Run same loop twice, but getting different results Fortran 95和Python上的相同方程式返回不同的结果 - Same equation on both Fortran 95 and Python return different results 从urllib2切换到请求,具有相同参数的结果却奇怪地不同 - Switching from urllib2 to requests, strangely different results with the same parameters 如何从相同的python脚本运行.py和.exe文件 - how to run both a .py and a .exe file from the same python script 将python脚本导入其他脚本并同时运行 - import python script to other and run both in the same time 同时运行具有两个不同参数的相同脚本 - Run same script with two different arguments simultaneously
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM