简体   繁体   中英

`requests` returning different results for the same script if run both locally and on the cloud

I am trying to send a request to this link and get its HTML content in a Python script. I am using requests library. The script runs perfectly well locally and gets the HTML content.

This script is just a test to make sure everything is working fine before I use another script to scrape that same link. This scraper would run for 2-3 days so I obviously wanted to run it on the cloud.

So, when I used that test script on a VM instance of Google Cloud Platform, it is getting the contents of a special page of the website which is used for IPs that are blacklisted. So this means that GCP's servers are blacklisted by the website. How can it be blacklisted if I have never used GCP before for web scraping?

How can I make sure that everything works as expected? Here is my code which checks if the HTML content is for a correct page or the incorrect one (The blacklist page contains the word 'engineers' in the body, so I am searching for that word to identify the page type):

import requests

base = 'https://www.autotrader.com/car-values/'

headers = {}  ## my headers go here. Without them, even locally, the script won't work

r = requests.get(base, headers=headers)
HTML = str(r.content)

if 'engineers' in HTML:
    print('You landed on blacklisted page')
else:
    print('Successful')

I also deployed this same script on Streamlit Cloud and it did not work there too. So I am guessing it is some kind of protection imposed by the website to deny all cloud servers.

Please help me...

There is 2 reasons

  • Either the websites has already blacklisted all the public IP range of all cloud provider
  • Or the IP that you use has been already used by another cloud customer, for webscraping, and it has been blacklisted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM