I am trying to send a request to this link and get its HTML content in a Python script. I am using requests
library. The script runs perfectly well locally and gets the HTML content.
This script is just a test to make sure everything is working fine before I use another script to scrape that same link. This scraper would run for 2-3 days so I obviously wanted to run it on the cloud.
So, when I used that test script on a VM instance of Google Cloud Platform, it is getting the contents of a special page of the website which is used for IPs that are blacklisted. So this means that GCP's servers are blacklisted by the website. How can it be blacklisted if I have never used GCP before for web scraping?
How can I make sure that everything works as expected? Here is my code which checks if the HTML content is for a correct page or the incorrect one (The blacklist page contains the word 'engineers' in the body, so I am searching for that word to identify the page type):
import requests
base = 'https://www.autotrader.com/car-values/'
headers = {} ## my headers go here. Without them, even locally, the script won't work
r = requests.get(base, headers=headers)
HTML = str(r.content)
if 'engineers' in HTML:
print('You landed on blacklisted page')
else:
print('Successful')
I also deployed this same script on Streamlit Cloud and it did not work there too. So I am guessing it is some kind of protection imposed by the website to deny all cloud servers.
Please help me...
There is 2 reasons
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.