如何通过阅读python中的html目录使用“urllib2”获取最新文件

Question

I would like to read latest file from http folder 我想从http文件夹中读取最新文件

'releases' folder should be like 0001.tgz, 0002.tgz, 0003.tgz how to make 0003 will be select? 'releases'文件夹应该像0001.tgz，0002.tgz，0003.tgz如何制作0003将被选中？

import urllib2

url = "http://example.com/releases"
html = urllib2.urlopen(url).read()
...

Thanks. 谢谢。 Give me some example. 给我一些例子。

Answer 1

You can use BeautifulSoup or lxml to parse the directory index and find the latest file, which is presumably last in the index, based on your naming convention. 您可以使用BeautifulSoup或lxml来解析目录索引，并根据您的命名约定查找最新文件，该文件可能是索引中的最后一个文件。

Something like this: 像这样的东西：

from bs4 import BeautifulSoup
import urllib2

url = "http://example.com/releases"
html = urllib2.urlopen(url).read()

soup = BeautifulSoup(html)

last_link = soup.find_all('a', href=True)[-1]

latest_content = urllib2.urlopen(last_link['href']).read()
# do stuff

If that won't work, grab all of the links using find_all and do some more careful parsing based on the filenames. 如果这不起作用，请使用find_all获取所有链接，并根据文件名进行更仔细的解析。

Answer 2

If the .tgz files are sequential, then count down from the maximum and stop the loop when you get to the first (newest) file. 如果.tgz文件是顺序的，那么从最大值开始倒计时，当你到达第一个（最新的）文件时停止循环。

import urllib2

for counter in xrange(9999,0,-1):
    fyle = str(counter).zfill(4) # pad zeros
    url = "http://example.com/releases/"+fyle+".tgz"
    ret = urllib2.urlopen(url)
    if ret.code == 200:
        print "Exists:",fyle
        break

    html = urllib2.urlopen(url).read()

如何通过阅读python中的html目录使用“urllib2”获取最新文件

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-02-11 09:27:13

解决方案2
0 2014-02-11 09:32:47

如何通过阅读python中的html目录使用“urllib2”获取最新文件

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-02-11 09:27:13

解决方案2 0 2014-02-11 09:32:47

解决方案1
2 已采纳 2014-02-11 09:27:13

解决方案2
0 2014-02-11 09:32:47