How to get latest file using “urllib2” by reading html directory in python

Question

I would like to read latest file from http folder

'releases' folder should be like 0001.tgz, 0002.tgz, 0003.tgz how to make 0003 will be select?

import urllib2

url = "http://example.com/releases"
html = urllib2.urlopen(url).read()
...

Thanks. Give me some example.

Answer 1

You can use BeautifulSoup or lxml to parse the directory index and find the latest file, which is presumably last in the index, based on your naming convention.

Something like this:

from bs4 import BeautifulSoup
import urllib2

url = "http://example.com/releases"
html = urllib2.urlopen(url).read()

soup = BeautifulSoup(html)

last_link = soup.find_all('a', href=True)[-1]

latest_content = urllib2.urlopen(last_link['href']).read()
# do stuff

If that won't work, grab all of the links using find_all and do some more careful parsing based on the filenames.

Answer 2

If the .tgz files are sequential, then count down from the maximum and stop the loop when you get to the first (newest) file.

import urllib2

for counter in xrange(9999,0,-1):
    fyle = str(counter).zfill(4) # pad zeros
    url = "http://example.com/releases/"+fyle+".tgz"
    ret = urllib2.urlopen(url)
    if ret.code == 200:
        print "Exists:",fyle
        break

    html = urllib2.urlopen(url).read()

How to get latest file using “urllib2” by reading html directory in python

Question

2 answers

solution1
2 ACCPTED 2014-02-11 09:27:13

solution2
0 2014-02-11 09:32:47

How to get latest file using “urllib2” by reading html directory in python

Question

2 answers

solution1 2 ACCPTED 2014-02-11 09:27:13

solution2 0 2014-02-11 09:32:47

solution1
2 ACCPTED 2014-02-11 09:27:13

solution2
0 2014-02-11 09:32:47