简体   繁体   中英

Can't download pictures from website using Python and requests

I'm practicing my web scraping skills in Python. I want to download images from a real estate website www.immobilier.ch . I did it successfully with other websites, but this time when I want to save content of URL, after saving I see this inside a file:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
</body></html>

Does anyone knows a way to avoid it? As I far as I understand this website identifies me as a bot. But weird that I can scrape everything else except for the pictures. I use Requests library for saving pictures, OS to save them in a right path and Selenium webdriver (Chrome). This is a sample of my code:

image_url = driver.find_element_by_class_name("im__col__content").find_element_by_tag_name("img").get_attribute("src") #comment
path = "C:/Users/potek/Jupyter_projects/APARTMENTS"
with open(os.path.join(path, "Immobilier"+str(time.time())+".jpg"), "wb") as f:
        f.write(requests.get(i).content)

If you're using browser controllers like Selenium and Webbot the headers sent to the server would be valid and the server won't be able to identify you as a bot unless your traffic is much greater volume than would be expected, eg, if you have 100 drivers open all clicking 10 times per second on images/links etc.

BUT, for the request you send direct to the image URL, you're not using the browser wrapper, you're using basic requests which does not come with the headers for free. You have to manually set headers to make the server think the request has come from a legitimate browser, for example:

header = {'User-agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.2 Safari/605.1.15'}
res = requests.get(url = 'https://www.immobilier.ch/Medias/bory-cie-agence-immobiliere-sa-21/641557/images/NewThumbnail/20445175.jpg', headers = header)

If the driver has a method to get the headers that are being used already then that would be a better solution as some server side request legitimacy checking compares the number of different browser headers received from a certain IP address and temporarily blocks those as well. If you want to scrape a LOT of data for a long time, cycling through a dozen or so free proxy IP addresses such as from https://free-proxy-list.net/uk-proxy.html as well as a dozen or so headers would also help keep you undetected.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM