简体   繁体   中英

How to get latest file using “urllib2” by reading html directory in python

I would like to read latest file from http folder

'releases' folder should be like 0001.tgz, 0002.tgz, 0003.tgz how to make 0003 will be select?

import urllib2

url = "http://example.com/releases"
html = urllib2.urlopen(url).read()
...

Thanks. Give me some example.

You can use BeautifulSoup or lxml to parse the directory index and find the latest file, which is presumably last in the index, based on your naming convention.

Something like this:

from bs4 import BeautifulSoup
import urllib2

url = "http://example.com/releases"
html = urllib2.urlopen(url).read()

soup = BeautifulSoup(html)

last_link = soup.find_all('a', href=True)[-1]

latest_content = urllib2.urlopen(last_link['href']).read()
# do stuff

If that won't work, grab all of the links using find_all and do some more careful parsing based on the filenames.

If the .tgz files are sequential, then count down from the maximum and stop the loop when you get to the first (newest) file.

import urllib2

for counter in xrange(9999,0,-1):
    fyle = str(counter).zfill(4) # pad zeros
    url = "http://example.com/releases/"+fyle+".tgz"
    ret = urllib2.urlopen(url)
    if ret.code == 200:
        print "Exists:",fyle
        break

    html = urllib2.urlopen(url).read()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM