简体   繁体   中英

R getURL() returning empty string

Sorry about the title but I couldn't think how to phrase this one.

I am trying to scrape webpages for a study - they will be subjected to a battery of linguistic tests eventually.

In the meantime...

    require(RCurl)
    url1 <- "http://www.coindesk.com/bitinstants-charlie-shrem-sees-bitcoin-battles-ahead"  
    url2 <- "http://www.coindesk.com/terms-conditions/"

    html <- getURL(url1)   # read in page contents
    html
    [1] ""

    html <- getURL(url2)   # read in page contents
    html
    [1] "<!DOCTYPE html>\r\n<!--[if lt IE 7]> <html class=\"no-js ie ie6 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 7]>    <html class=\"no-js ie ie7 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 8]>......."

So given two URLs, each for different pages on the same website - the request for url1 returns an empty string. But url2 works just fine.

I've tried adding a browser agent as;

html <- getURL(url1, .opts=list(useragent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"))   # read in page contents

but that makes no difference, still an empty string.

I'm only on day two of learning R and now I AM STUMPED!

Can anyone suggest a reason why this is happening or a solution,

To get this to work with RCurl, you need to use

getURL(url1, .opts=curlOptions(followlocation = TRUE))

I wish I could tell you why. When looking at the requests in Chrome I don't see any redirects, but maybe i'm missing something.

Note that you could also use the httr library

library(httr)
GET(url1)

I'm not exactly sure why getURL isn't working on that content, but htmlParse from package XML seems to get the content okay.

Try this:

> library(XML)
> htmlParse(url1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM