简体   繁体   English

如何通过阅读python中的html目录使用“urllib2”获取最新文件

[英]How to get latest file using “urllib2” by reading html directory in python

I would like to read latest file from http folder 我想从http文件夹中读取最新文件

'releases' folder should be like 0001.tgz, 0002.tgz, 0003.tgz how to make 0003 will be select? 'releases'文件夹应该像0001.tgz,0002.tgz,0003.tgz如何制作0003将被选中?

import urllib2

url = "http://example.com/releases"
html = urllib2.urlopen(url).read()
...

Thanks. 谢谢。 Give me some example. 给我一些例子。

You can use BeautifulSoup or lxml to parse the directory index and find the latest file, which is presumably last in the index, based on your naming convention. 您可以使用BeautifulSouplxml来解析目录索引,并根据您的命名约定查找最新文件,该文件可能是索引中的最后一个文件。

Something like this: 像这样的东西:

from bs4 import BeautifulSoup
import urllib2

url = "http://example.com/releases"
html = urllib2.urlopen(url).read()

soup = BeautifulSoup(html)

last_link = soup.find_all('a', href=True)[-1]

latest_content = urllib2.urlopen(last_link['href']).read()
# do stuff

If that won't work, grab all of the links using find_all and do some more careful parsing based on the filenames. 如果这不起作用,请使用find_all获取所有链接,并根据文件名进行更仔细的解析。

If the .tgz files are sequential, then count down from the maximum and stop the loop when you get to the first (newest) file. 如果.tgz文件是顺序的,那么从最大值开始倒计时,当你到达第一个(最新的)文件时停止循环。

import urllib2

for counter in xrange(9999,0,-1):
    fyle = str(counter).zfill(4) # pad zeros
    url = "http://example.com/releases/"+fyle+".tgz"
    ret = urllib2.urlopen(url)
    if ret.code == 200:
        print "Exists:",fyle
        break

    html = urllib2.urlopen(url).read()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM