简体   繁体   中英

Downloading files with complicated name structures from HTTP

When I try to download a file using this code:

import urllib
    urllib.urlretrieve("http://e4ftl01.cr.usgs.gov/MOLT/MOD11A1.005/2012.07.11/MOD11A1.A2012193.h22v10.005.2012196013617.hdf","1.hdf")

the file is correctly downloaded.

But my objective is to build a function that will download files depending to some inputs that are parts of the file name.

There are many files one the webpage. Some parts of the file names are the same for every file, (eg "/MOLT/MOD11A1.005/"), so this is not a problem. Some other parts change from file to file following some well defined rules (eg"h22v10") and I have solved this using %s (eg h%sv%s), so this isn't a problem either. The problem is that some parts of the names change without any rule (eg "2012196013617", ). These parts of the name does not matter, and I want to ignore these parts. So, I want to download files whose names contain the first two parts (the part that does not change, and the part that changes under a rule) and WHATEVER else.

I thought, I could use wildcards for WHATEVER, so I tried this:

  import urllib

  def download(url,date,h,v):
      urllib.urlretrieve("%s/MOLT/MOD11A1.005/%s/MOD11A1.*.h%sv%s.005.*.hdf" %
        (url, date1, h, v), "2.hdf")

  download("http://e4ftl01.cr.usgs.gov", "2012.07.11", "22", "10")

This does not download the requested file, but instead generates an error file that says:

 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
 <html>
   <head>
     <title>404 Not Found</title>
   </head>
   <body>
     <h1>Not Foun    d</h1>
     <p>The requested URL /MOLT/MOD11A1.005/2012.07.11/MOD11A1\*\h22v10.005\*\.hdf was not found on this server.</p    >
   </body>
 </html>

It seems like wildcards do not work with HTTP. Do you have any idea how to solve this?

The problem is that some parts of the names change without any rule (eg "2012196013617", ). These parts of the name does not matter, and I want to ignore these parts

That is not possible. HTTP URLs do not support 'wildcards'. You must provide an existing URL.

Here is a solution: This assumes that the PartialName is a string with the first part of the filename (as much as is known and constant), that URLtoSearch is the URL where the file can be found (also a string), and that FileExtension a string of the form ".ext", ".mp3", ".zip", etc

def findURLFile(PartialName, URLtoSearch, FileExtension):
    import urllib2

    sourceURL = urllib2.urlopen(URLtoSearch)
    readURL = sourceURL.read()

    #find the first instance of PartialName and get the Index
    #of the first character in the string (an integer)
    fileIndexStart = readURL.find(PartialName)

    #find the first instance of the file extension after the first
    #instance of the string and add 4 to get past the extension
    fileIndexEnd = readURL[fileIndexStart:].find(FileExtension) + 4

    #get the filename
    fileName = readURL[fileIndexStart:fileIndexStart+fileIndexEnd]

    #stop reading the url -not sure if this is necessary 
    sourceURL.close()
    #output the URL to download the file from
    downloadURL = URLtoSearch + fileName
    return downloadURL

I am rather new at coding python and this could probably benefit from some exception handling and perhaps a while loop. It works for what I need, but I will likely refine the code and make it more elegant.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM