简体   繁体   中英

Parsing the file name from list of url links

Ok so I am using a script that is downloading a files from urls listed in a urls.txt.

import urllib.request

with open("urls.txt", "r") as file:
    linkList = file.readlines()
for link in linkList:
    urllib.request.urlretrieve(link)

Unfortunately they are saved as temporary files due to lack of second argument in my urllib.request.urlretrieve function. As there are thousand of links in my text file naming them separately is not an option. The thing is that the name of the file is contained in those links, ie /DocumentXML2XLSDownload.vm?firsttime=true&repengback=true&d‌​ocumentId=XXXXXX&xsl‌​FileName=rher2xml.xs‌​l&outputFileName=XXX‌​X_2017_06_25_4.xls where the name of the file comes after outputFileName=

Is there an easy way to parse the file names and then use them in urllib.request.urlretrieve function as secondary argument? I was thinking of extracting those names in excel and placing them in another text file that would be read in similar fashion as urls.txt but I'm not sure how to implement it in Python. Or is there a way to make it exclusively in python without using excel?

You could parse the link on the go.

Example using a regular expression :

import re

with open("urls.txt", "r") as file:
    linkList = file.readlines()
for link in linkList:
    regexp = '((?<=\?outputFileName=)|(?<=\&outputFileName=))[^&]+'
    match = re.search(regexp, link.rstrip())

    if match is None:
        # Make the user aware that something went wrong, e.g. raise exception
        # and/or just print something
        print("WARNING: Couldn't find file name in link [" + link + "]. Skipping...")
    else:
        file_name = match.group(0)
        urllib.request.urlretrieve(link, file_name)

You can use urlparse and parse_qs to get the query string

 from urlparse import urlparse,parse_qs parse = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html?name=Python&version=2') print(parse_qs(parse.query)['name'][0]) # prints Python 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM