抓取：從網址下載文件

Question

我想自動從此頁面下載文件。

我試過很多方法，比如：

download.file
read.table
GET

但沒有成功。 我不是要求代碼，但我要求任何提示/想法來處理這種情況。

Answer 1

使用BeautifulSoup Python版本。

try:
    # Python 3.x
    from urllib.request import urlopen, urlretrieve, quote
    from urllib.parse import urljoin
except ImportError:
    # Python 2.x
    from urllib import urlopen, urlretrieve, quote
    from urlparse import urljoin

from bs4 import BeautifulSoup

url = 'http://oilandgas.ky.gov/Pages/ProductionReports.aspx'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
        continue
    filename = href.rsplit('/', 1)[-1]
    href = urljoin(url, quote(href))
    try:
        urlretrieve(href, filename)
    except:
        print('failed to download')

Answer 2

這對我有用：

getIt = function(what,when){ 
     url=paste0("http://oilandgas.ky.gov/Production%20Reports%20Library/",
                 when,"%20-%20",what,
                "%20Production.xls")
     destfile=paste0("/tmp/",what,when,".xls")
     download.file(url,destfile)
}

例如：

> getIt("gas",2006)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2006%20-%20gas%20Production.xls'
Content type 'application/vnd.ms-excel' length 3490304 bytes (3.3 Mb)
opened URL
==================================================
downloaded 3.3 Mb

除了第一個：

> getIt("oil",2010)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls'
Error in download.file(url, destfile) : 
  cannot open URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls'
In addition: Warning message:
In download.file(url, destfile) :
  cannot open: HTTP status was '404 NOT FOUND'

雖然我可以獲得2010年的天然氣數據：

> getIt("gas",2010)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20gas%20Production.xls'
Content type 'application/vnd.ms-excel' length 4177408 bytes (4.0 Mb)
opened URL
==================================================
downloaded 4.0 Mb

所以看起來他們改變了那個鏈接的系統。 您可以通過鏈接獲取該數據，然后在cruddy Sharepoint HTML中查找下載鏈接。

這就是為什么我們討厭Sharepoint，小孩。

抓取：從網址下載文件

問題描述

2 個解決方案

解決方案1
7 已采納 2014-01-08 04:39:57

解決方案2
5 2014-01-08 11:34:59

抓取：從網址下載文件

問題描述

2 個解決方案

解決方案1 7 已采納 2014-01-08 04:39:57

解決方案2 5 2014-01-08 11:34:59

解決方案1
7 已采納 2014-01-08 04:39:57

解決方案2
5 2014-01-08 11:34:59