R getURL（）返回空字符串

Question

Sorry about the title but I couldn't think how to phrase this one. 对不起标题，但我想不出该如何措辞。

I am trying to scrape webpages for a study - they will be subjected to a battery of linguistic tests eventually. 我正在尝试抓取网页进行研究-最终将对它们进行一系列的语言测试。

In the meantime... 同时...

    require(RCurl)
    url1 <- "http://www.coindesk.com/bitinstants-charlie-shrem-sees-bitcoin-battles-ahead"  
    url2 <- "http://www.coindesk.com/terms-conditions/"

    html <- getURL(url1)   # read in page contents
    html
    [1] ""

    html <- getURL(url2)   # read in page contents
    html
    [1] "<!DOCTYPE html>\r\n<!--[if lt IE 7]> <html class=\"no-js ie ie6 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 7]>    <html class=\"no-js ie ie7 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 8]>......."

So given two URLs, each for different pages on the same website - the request for url1 returns an empty string. 因此，给定两个URL，每个URL代表同一网站上的不同页面url1的请求返回一个空字符串。 But url2 works just fine. 但是url2可以正常工作。

I've tried adding a browser agent as; 我尝试添加浏览器代理为；

html <- getURL(url1, .opts=list(useragent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"))   # read in page contents

but that makes no difference, still an empty string. 但这没什么区别，仍然是一个空字符串。

I'm only on day two of learning R and now I AM STUMPED! 我只是在学习R的第二天，现在我已经不知所措了！

Can anyone suggest a reason why this is happening or a solution, 任何人都可以提出发生这种情况的原因或解决方案，

Answer 1

To get this to work with RCurl, you need to use 要使其与RCurl一起使用，您需要使用

getURL(url1, .opts=curlOptions(followlocation = TRUE))

I wish I could tell you why. 我希望我能告诉你为什么。 When looking at the requests in Chrome I don't see any redirects, but maybe i'm missing something. 在Chrome浏览器中查看请求时，我看不到任何重定向，但也许我丢失了一些内容。

Note that you could also use the httr library 请注意，您还可以使用httr库

library(httr)
GET(url1)

Answer 2

I'm not exactly sure why getURL isn't working on that content, but htmlParse from package XML seems to get the content okay. 我不确定为什么 getURL不能处理该内容，但是来自包XML htmlParse似乎可以使内容正常。

Try this: 尝试这个：

> library(XML)
> htmlParse(url1)

R getURL（）返回空字符串

问题描述

2 个解决方案

解决方案1
3 2014-08-22 20:03:32

解决方案2
0 已采纳 2014-08-22 17:58:07

R getURL（）返回空字符串

问题描述

2 个解决方案

解决方案1 3 2014-08-22 20:03:32

解决方案2 0 已采纳 2014-08-22 17:58:07

解决方案1
3 2014-08-22 20:03:32

解决方案2
0 已采纳 2014-08-22 17:58:07