简体   繁体   English

R getURL()返回空字符串

[英]R getURL() returning empty string

Sorry about the title but I couldn't think how to phrase this one. 对不起标题,但我想不出该如何措辞。

I am trying to scrape webpages for a study - they will be subjected to a battery of linguistic tests eventually. 我正在尝试抓取网页进行研究-最终将对它们进行一系列的语言测试。

In the meantime... 同时...

    require(RCurl)
    url1 <- "http://www.coindesk.com/bitinstants-charlie-shrem-sees-bitcoin-battles-ahead"  
    url2 <- "http://www.coindesk.com/terms-conditions/"

    html <- getURL(url1)   # read in page contents
    html
    [1] ""

    html <- getURL(url2)   # read in page contents
    html
    [1] "<!DOCTYPE html>\r\n<!--[if lt IE 7]> <html class=\"no-js ie ie6 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 7]>    <html class=\"no-js ie ie7 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 8]>......."

So given two URLs, each for different pages on the same website - the request for url1 returns an empty string. 因此,给定两个URL,每个URL代表同一网站上的不同页面url1的请求返回一个空字符串。 But url2 works just fine. 但是url2可以正常工作。

I've tried adding a browser agent as; 我尝试添加浏览器代理为;

html <- getURL(url1, .opts=list(useragent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13"))   # read in page contents

but that makes no difference, still an empty string. 但这没什么区别,仍然是一个空字符串。

I'm only on day two of learning R and now I AM STUMPED! 我只是在学习R的第二天,现在我已经不知所措了!

Can anyone suggest a reason why this is happening or a solution, 任何人都可以提出发生这种情况的原因或解决方案,

To get this to work with RCurl, you need to use 要使其与RCurl一起使用,您需要使用

getURL(url1, .opts=curlOptions(followlocation = TRUE))

I wish I could tell you why. 我希望我能告诉你为什么。 When looking at the requests in Chrome I don't see any redirects, but maybe i'm missing something. 在Chrome浏览器中查看请求时,我看不到任何重定向,但也许我丢失了一些内容。

Note that you could also use the httr library 请注意,您还可以使用httr

library(httr)
GET(url1)

I'm not exactly sure why getURL isn't working on that content, but htmlParse from package XML seems to get the content okay. 我不确定为什么 getURL不能处理该内容,但是来自包XML htmlParse似乎可以使内容正常。

Try this: 尝试这个:

> library(XML)
> htmlParse(url1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM