简体   繁体   中英

Can't download a file from a website using Python

I'm a webscraping newbie and I'm having issues being able to use all of the.get methods imaginable to download some excel files from a website. I have been able to easily parse the HTML to get the URLs for every link on the page, but I'm not experienced enough to understand why on earth I cannot download the file (cookies, sessions, etc., no idea).

Here is the website:

https://mlcu.org.eg/ar/3118/%D9%82%D9%88%D8%A7%D8%A6%D9%85-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D8%A7%D9%85%D9%86-%D8%B0%D8%A7%D8%AA-%D8%A7%D9%84%D8%B5%D9%84%D8%A9

If you scroll down you'll find the 5 excel file links, none of which I've been able to download. (just search for id="AutoDownload"

When I try to use the requests.get method, and save the file using

import requests
requests.Session()
res = requests.get(url).content
with open(filename) as f:
   f.write(res.content)

I get an error that res is a bytes object and when I view res as a variable, the output is:

b'<html><head><title>Request Rejected</title></head><body>The requested URL was rejected. 
Please consult with your administrator.<br><br>Your support ID is: 11190392837244519859</body></html>

Been trying for a while now, would really appreciate any help. Thanks a lot.

If you're not experienced enough to set all correct parameters manually in the HTTP requests so as to avoid the "Request rejected" error you have (for my part, I wouldn't be able to), I would advise you to use a higher level approach such as Selenium.

Selenium can automate actions performed by a browser installed on your computer, such as downloading files (thus it is used to automate tests on web apps as well as to do web scraping). The idea is that the HTTP request generated by the browser would be better than the one you can write by hand.

Here is a tutorial to do what you try to do using Selenium.

In order to download the files, you need to set the "User-Agent" field in the header of your python request. This can be done by passing a dict to the get function:

 file = session.get(url,headers=my_headers)

Apparently, this host does not respond to requests that come from python which have the following User-Agent:

'User-Agent': 'python-requests/2.24.0'

With this in mind, if you pass another value for that field in the header of your request, for example one from Firefox (see below), the host thinks the request comes from a Firefox user and will respond with the actual file.

Here is the full version of the code:

import requests

my_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
    'Accept-Encoding': 'gzip, deflate',
    'Accept': '*/*',
    'Connection': 'keep-alive'
    }

session = requests.session()
file = session.get(url, headers=my_headers)
                
with open(filename, 'wb') as f:
    f.write(file.content) 

The latest Firefox user agent worked for me but you can find many more possible values for that field here .

So I finally came up with a solution using only requests and the standard Python HTML parser.

From what I found, the Request rejected error is generally difficult to trace back to a precise cause. In that case, it was due to the absence of a user agent in the HTTP request.

import requests
from html.parser import HTMLParser

# Custom parser to retrieve the links
link_urls = []
class AutoDownloadLinksHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if(tag == 'a' and [attr for attr in attrs if attr == ('id', 'AutoDownload')]):
            href = [attr[1] for attr in attrs if attr[0] == 'href'][0]
            link_urls.append(href)

# Get the links to the files
url = 'https://mlcu.org.eg/ar/3118/%D9%82%D9%88%D8%A7%D8%A6%D9%85-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D8%A7%D9%85%D9%86-%D8%B0%D8%A7%D8%AA-%D8%A7%D9%84%D8%B5%D9%84%D8%A9'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
links_page = requests.get(url, headers=headers)
AutoDownloadLinksHTMLParser().feed(links_page.content.decode('utf-8'))

# Download the files
host = 'https://mlcu.org.eg'
for i, link_url in enumerate(link_urls):
    file_content = requests.get(host + link_urls[i], headers = headers).content
    with open('file' + str(i) + '.xls', 'wb+') as f:
        f.write(file_content)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM