R getURL() returning empty string

Question

Sorry about the title but I couldn't think how to phrase this one.

I am trying to scrape webpages for a study - they will be subjected to a battery of linguistic tests eventually.

In the meantime...

    require(RCurl)
    url1 <- "http://www.coindesk.com/bitinstants-charlie-shrem-sees-bitcoin-battles-ahead"  
    url2 <- "http://www.coindesk.com/terms-conditions/"

    html <- getURL(url1)   # read in page contents
    html
    [1] ""

    html <- getURL(url2)   # read in page contents
    html
    [1] "<!DOCTYPE html>\r\n<!--[if lt IE 7]> <html class=\"no-js ie ie6 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 7]>    <html class=\"no-js ie ie7 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 8]>......."

So given two URLs, each for different pages on the same website - the request for url1 returns an empty string. But url2 works just fine.

I've tried adding a browser agent as;

html <- getURL(url1, .opts=list(useragent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"))   # read in page contents

but that makes no difference, still an empty string.

I'm only on day two of learning R and now I AM STUMPED!

Can anyone suggest a reason why this is happening or a solution,

Answer 1

To get this to work with RCurl, you need to use

getURL(url1, .opts=curlOptions(followlocation = TRUE))

I wish I could tell you why. When looking at the requests in Chrome I don't see any redirects, but maybe i'm missing something.

Note that you could also use the httr library

library(httr)
GET(url1)

Answer 2

I'm not exactly sure why getURL isn't working on that content, but htmlParse from package XML seems to get the content okay.

Try this:

> library(XML)
> htmlParse(url1)

R getURL() returning empty string

Question

2 answers

solution1
3 2014-08-22 20:03:32

solution2
0 ACCPTED 2014-08-22 17:58:07

R getURL() returning empty string

Question

2 answers

solution1 3 2014-08-22 20:03:32

solution2 0 ACCPTED 2014-08-22 17:58:07

solution1
3 2014-08-22 20:03:32

solution2
0 ACCPTED 2014-08-22 17:58:07