[英]scraping: download files from url
Python version that use BeautifulSoup
. 使用
BeautifulSoup
Python版本。
try:
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
except ImportError:
# Python 2.x
from urllib import urlopen, urlretrieve, quote
from urlparse import urljoin
from bs4 import BeautifulSoup
url = 'http://oilandgas.ky.gov/Pages/ProductionReports.aspx'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
try:
urlretrieve(href, filename)
except:
print('failed to download')
This works for me: 这对我有用:
getIt = function(what,when){
url=paste0("http://oilandgas.ky.gov/Production%20Reports%20Library/",
when,"%20-%20",what,
"%20Production.xls")
destfile=paste0("/tmp/",what,when,".xls")
download.file(url,destfile)
}
for example: 例如:
> getIt("gas",2006)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2006%20-%20gas%20Production.xls'
Content type 'application/vnd.ms-excel' length 3490304 bytes (3.3 Mb)
opened URL
==================================================
downloaded 3.3 Mb
EXCEPT for the first one: 除了第一个:
> getIt("oil",2010)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls'
Error in download.file(url, destfile) :
cannot open URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls'
In addition: Warning message:
In download.file(url, destfile) :
cannot open: HTTP status was '404 NOT FOUND'
although I can get 2010's gas data: 虽然我可以获得2010年的天然气数据:
> getIt("gas",2010)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20gas%20Production.xls'
Content type 'application/vnd.ms-excel' length 4177408 bytes (4.0 Mb)
opened URL
==================================================
downloaded 4.0 Mb
So it looks like they changed the system for that one link. 所以看起来他们改变了那个链接的系统。 You can get that data by following the link and then looking for the download link in the cruddy Sharepoint HTML.
您可以通过链接获取该数据,然后在cruddy Sharepoint HTML中查找下载链接。
And this is why we hate Sharepoint, kiddies. 这就是为什么我们讨厌Sharepoint,小孩。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.