简体   繁体   English

抓取:从网址下载文件

[英]scraping: download files from url

I want to automatically download files from this page . 我想自动从此页面下载文件。

I tried many methods like: 我试过很多方法,比如:

download.file
read.table
GET

But without success. 但没有成功。 I am not asking for code , but I am asking for any hint/idea to deal with such situation. 我不是要求代码,但我要求任何提示/想法来处理这种情况。

Python version that use BeautifulSoup . 使用BeautifulSoup Python版本。

try:
    # Python 3.x
    from urllib.request import urlopen, urlretrieve, quote
    from urllib.parse import urljoin
except ImportError:
    # Python 2.x
    from urllib import urlopen, urlretrieve, quote
    from urlparse import urljoin

from bs4 import BeautifulSoup

url = 'http://oilandgas.ky.gov/Pages/ProductionReports.aspx'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
        continue
    filename = href.rsplit('/', 1)[-1]
    href = urljoin(url, quote(href))
    try:
        urlretrieve(href, filename)
    except:
        print('failed to download')

This works for me: 这对我有用:

getIt = function(what,when){ 
     url=paste0("http://oilandgas.ky.gov/Production%20Reports%20Library/",
                 when,"%20-%20",what,
                "%20Production.xls")
     destfile=paste0("/tmp/",what,when,".xls")
     download.file(url,destfile)
}

for example: 例如:

> getIt("gas",2006)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2006%20-%20gas%20Production.xls'
Content type 'application/vnd.ms-excel' length 3490304 bytes (3.3 Mb)
opened URL
==================================================
downloaded 3.3 Mb

EXCEPT for the first one: 除了第一个:

> getIt("oil",2010)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls'
Error in download.file(url, destfile) : 
  cannot open URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls'
In addition: Warning message:
In download.file(url, destfile) :
  cannot open: HTTP status was '404 NOT FOUND'

although I can get 2010's gas data: 虽然我可以获得2010年的天然气数据:

> getIt("gas",2010)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20gas%20Production.xls'
Content type 'application/vnd.ms-excel' length 4177408 bytes (4.0 Mb)
opened URL
==================================================
downloaded 4.0 Mb

So it looks like they changed the system for that one link. 所以看起来他们改变了那个链接的系统。 You can get that data by following the link and then looking for the download link in the cruddy Sharepoint HTML. 您可以通过链接获取该数据,然后在cruddy Sharepoint HTML中查找下载链接。

And this is why we hate Sharepoint, kiddies. 这就是为什么我们讨厌Sharepoint,小孩。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM