[英]R getURL() returning empty string
Sorry about the title but I couldn't think how to phrase this one. 对不起标题,但我想不出该如何措辞。
I am trying to scrape webpages for a study - they will be subjected to a battery of linguistic tests eventually. 我正在尝试抓取网页进行研究-最终将对它们进行一系列的语言测试。
In the meantime... 同时...
require(RCurl)
url1 <- "http://www.coindesk.com/bitinstants-charlie-shrem-sees-bitcoin-battles-ahead"
url2 <- "http://www.coindesk.com/terms-conditions/"
html <- getURL(url1) # read in page contents
html
[1] ""
html <- getURL(url2) # read in page contents
html
[1] "<!DOCTYPE html>\r\n<!--[if lt IE 7]> <html class=\"no-js ie ie6 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 7]> <html class=\"no-js ie ie7 oldie\" lang=\"en\"> <![endif]-->\r\n<!--[if IE 8]>......."
So given two URLs, each for different pages on the same website - the request for url1
returns an empty string. 因此,给定两个URL,每个URL代表同一网站上的不同页面
url1
的请求返回一个空字符串。 But url2
works just fine. 但是
url2
可以正常工作。
I've tried adding a browser agent as; 我尝试添加浏览器代理为;
html <- getURL(url1, .opts=list(useragent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13")) # read in page contents
but that makes no difference, still an empty string. 但这没什么区别,仍然是一个空字符串。
I'm only on day two of learning R and now I AM STUMPED! 我只是在学习R的第二天,现在我已经不知所措了!
Can anyone suggest a reason why this is happening or a solution, 任何人都可以提出发生这种情况的原因或解决方案,
To get this to work with RCurl, you need to use 要使其与RCurl一起使用,您需要使用
getURL(url1, .opts=curlOptions(followlocation = TRUE))
I wish I could tell you why. 我希望我能告诉你为什么。 When looking at the requests in Chrome I don't see any redirects, but maybe i'm missing something.
在Chrome浏览器中查看请求时,我看不到任何重定向,但也许我丢失了一些内容。
Note that you could also use the httr
library 请注意,您还可以使用
httr
库
library(httr)
GET(url1)
I'm not exactly sure why getURL
isn't working on that content, but htmlParse
from package XML
seems to get the content okay. 我不确定为什么
getURL
不能处理该内容,但是来自包XML
htmlParse
似乎可以使内容正常。
Try this: 尝试这个:
> library(XML)
> htmlParse(url1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.