scraping: download files from url

Question

I want to automatically download files from this page .

I tried many methods like:

download.file
read.table
GET

But without success. I am not asking for code , but I am asking for any hint/idea to deal with such situation.

Answer 1

Python version that use BeautifulSoup .

try:
    # Python 3.x
    from urllib.request import urlopen, urlretrieve, quote
    from urllib.parse import urljoin
except ImportError:
    # Python 2.x
    from urllib import urlopen, urlretrieve, quote
    from urlparse import urljoin

from bs4 import BeautifulSoup

url = 'http://oilandgas.ky.gov/Pages/ProductionReports.aspx'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
        continue
    filename = href.rsplit('/', 1)[-1]
    href = urljoin(url, quote(href))
    try:
        urlretrieve(href, filename)
    except:
        print('failed to download')

Answer 2

This works for me:

getIt = function(what,when){ 
     url=paste0("http://oilandgas.ky.gov/Production%20Reports%20Library/",
                 when,"%20-%20",what,
                "%20Production.xls")
     destfile=paste0("/tmp/",what,when,".xls")
     download.file(url,destfile)
}

for example:

> getIt("gas",2006)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2006%20-%20gas%20Production.xls'
Content type 'application/vnd.ms-excel' length 3490304 bytes (3.3 Mb)
opened URL
==================================================
downloaded 3.3 Mb

EXCEPT for the first one:

> getIt("oil",2010)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls'
Error in download.file(url, destfile) : 
  cannot open URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20oil%20Production.xls'
In addition: Warning message:
In download.file(url, destfile) :
  cannot open: HTTP status was '404 NOT FOUND'

although I can get 2010's gas data:

> getIt("gas",2010)
trying URL 'http://oilandgas.ky.gov/Production%20Reports%20Library/2010%20-%20gas%20Production.xls'
Content type 'application/vnd.ms-excel' length 4177408 bytes (4.0 Mb)
opened URL
==================================================
downloaded 4.0 Mb

So it looks like they changed the system for that one link. You can get that data by following the link and then looking for the download link in the cruddy Sharepoint HTML.

And this is why we hate Sharepoint, kiddies.

scraping: download files from url

Question

2 answers

solution1
7 ACCPTED 2014-01-08 04:39:57

solution2
5 2014-01-08 11:34:59

scraping: download files from url

Question

2 answers

solution1 7 ACCPTED 2014-01-08 04:39:57

solution2 5 2014-01-08 11:34:59

solution1
7 ACCPTED 2014-01-08 04:39:57

solution2
5 2014-01-08 11:34:59