简体   繁体   中英

Downloading File in Python (With Requests?)

What I'm trying to do is build a simple crawler to help me download guitar tabs from Ultimate-Guitar. I'm able to feed it a URL for a band and it will grab links for all tabs that are listed as 'Guitar Pro' tabs.

A typical link looks like this:

https://tabs.ultimate-guitar.com/a/agalloch/you_were_but_a_ghost_in_my_arms_guitar_pro.htm

What I'm able to do with this link is find the tab_id using the following code:

for tabid in tab.findAll("input", {"type" : "hidden", "name" : "id", "id" : "tab_id"}):
        tabID = tabid.get("value")

What I'm trying to do is use this to build a link to the actual download. Where I run into an issue is here. The best link I can build looks like this:

https://tabs.ultimate-guitar.com/tabs/download?id=904610

Note that the id at the end of that URL is the tab_id I referred to before.

This link will immediately result in a download if input into the browser. Where I run into my issue is that I can't find any way to generate a link that relies on the actual filename. This filename should look something like [song name here].gp5. Other acceptable file-typed could be .gpx, .gp4, and .gp3.

What I would like to do is to get the actual filename so that I could save the file properly (it doesn't help me if the downloads are named junk such as the ID since that's a useless filename for me, and I obviously need the proper extension). Is there some way to take the link above and initialize the download properly or might I be out of luck on this one? I'm certain there's a way to do what I need, I just don't have enough experience with this sort of thing. I'm pretty ignorant when it comes to requests and whatnot, so perhaps it's possible to feed something this URL and get the download in return?

Note: if it's too difficult to get the actual file name along with the extension, I do have ideas for a workaround there, but I obviously at least need the appropriate extension.

The filename is included in the headers of the response. You can parse these out of the headers with cgi.parse_header() , and use that so save the file:

>>> import requests
>>> r = requests.get('https://tabs.ultimate-guitar.com/tabs/download?id=904610')
>>> r.headers['Content-Disposition']
'attachment; filename="Agalloch - You Were But A Ghost In My Arms (Pro).gp5"'
>>> cgi.parse_header(r.headers['Content-Disposition'])[-1]['filename']
'Agalloch - You Were But A Ghost In My Arms (Pro).gp5'

A complete function to do the downloading could look like this:

import cgi
import requests
import shutil

def download_url(url, directory):
    """Download file from url to directory

    URL is expected to have a Content-Disposition header telling us what
    filename to use.

    Returns filename of downloaded file.

    """
    response = requests.get(url, stream=True)
    if response.status != 200:
        raise ValueError('Failed to download')

    params = cgi.parse_header(
        response.headers.get('Content-Disposition', ''))[-1]
    if 'filename' not in params:
        raise ValueError('Could not find a filename')

    filename = os.path.basename(params['filename'])
    abs_path = os.path.join(directory, filename)
    with open(abs_path, 'wb') as target:
        response.raw.decode_content = True
        shutil.copyfileobj(response.raw, target)

    return filename

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM