[英]Scraping https website using getURL
I had a nice little package to scrape Google Ngram data but I have discovered they have switched to SSL and my package has broken. 我有一个很好的小包来刮掉谷歌Ngram的数据,但我发现他们已经切换到SSL而我的包已经坏了。 If I switch from
readLines
to getURL
gets some of the way there, but some of the included script in the page is missing. 如果我从
readLines
切换到getURL
得到一些方法,但页面中的一些包含的脚本丢失了。 Do I need to get fancy with user agents or something? 我是否需要了解用户代理或其他什么?
Here is what I have tried so far (pretty basic): 这是我到目前为止所尝试的(非常基本):
library(RCurl)
myurl <- "https://books.google.com/ngrams/graph?content=hacker&year_start=1950&year_end=2000"
getURL(myurl)
Comparing the results to viewing the source after entering the url in a browser shows that the crucial content is missing from the results returned to R. In the browser, the source includes content looking like this: 将结果与在浏览器中输入URL后查看源进行比较后,结果显示返回R的结果中缺少关键内容。在浏览器中,源包含如下内容:
<script type="text/javascript">
var data = [{"ngram": "hacker", "type": "NGRAM", "timeseries": [9.4930387994907051e-09,
1.1685493106483591e-08, 1.0784501440023556e-08, 1.0108472218003532e-08,
etc. 等等
Any suggestions would be greatly appreciated! 任何建议将不胜感激!
Sorry, not a direct solution, but it doesn't seem to be an user-agent problem. 对不起,不是直接解决方案,但它似乎不是用户代理问题。 When you open your URL in a browser, you can see that there is a redirection that adds a parameter at the end of the address :
direct_url=t1%3B%2Chacker%3B%2Cc0
. 当您在浏览器中打开URL时,您可以看到存在重定向,该地址在地址末尾添加了一个参数:
direct_url=t1%3B%2Chacker%3B%2Cc0
。
If you use getURL()
to download this new URL, complete with the new parameter, then the javascript you are mentioning is present in the result. 如果您使用
getURL()
下载此新URL,并使用新参数,那么您提到的javascript将出现在结果中。
Another solution could be to try to access data via Google BigQuery, as mentioned in this SO question : 另一种解决方案可能是尝试通过Google BigQuery访问数据,如本SO问题所述:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.