使用getURL刮取https网站

Question

I had a nice little package to scrape Google Ngram data but I have discovered they have switched to SSL and my package has broken. 我有一个很好的小包来刮掉谷歌Ngram的数据，但我发现他们已经切换到SSL而我的包已经坏了。 If I switch from readLines to getURL gets some of the way there, but some of the included script in the page is missing. 如果我从readLines切换到getURL得到一些方法，但页面中的一些包含的脚本丢失了。 Do I need to get fancy with user agents or something? 我是否需要了解用户代理或其他什么？

Here is what I have tried so far (pretty basic): 这是我到目前为止所尝试的（非常基本）：

library(RCurl)
myurl <- "https://books.google.com/ngrams/graph?content=hacker&year_start=1950&year_end=2000"
getURL(myurl)

Comparing the results to viewing the source after entering the url in a browser shows that the crucial content is missing from the results returned to R. In the browser, the source includes content looking like this: 将结果与在浏览器中输入URL后查看源进行比较后，结果显示返回R的结果中缺少关键内容。在浏览器中，源包含如下内容：

<script type="text/javascript">
 var data = [{"ngram": "hacker", "type": "NGRAM", "timeseries": [9.4930387994907051e-09,
  1.1685493106483591e-08, 1.0784501440023556e-08, 1.0108472218003532e-08,

etc. 等等

Any suggestions would be greatly appreciated! 任何建议将不胜感激！

Answer 1

Sorry, not a direct solution, but it doesn't seem to be an user-agent problem. 对不起，不是直接解决方案，但它似乎不是用户代理问题。 When you open your URL in a browser, you can see that there is a redirection that adds a parameter at the end of the address : direct_url=t1%3B%2Chacker%3B%2Cc0 . 当您在浏览器中打开URL时，您可以看到存在重定向，该地址在地址末尾添加了一个参数： direct_url=t1%3B%2Chacker%3B%2Cc0 。

If you use getURL() to download this new URL, complete with the new parameter, then the javascript you are mentioning is present in the result. 如果您使用getURL()下载此新URL，并使用新参数，那么您提到的javascript将出现在结果中。

Another solution could be to try to access data via Google BigQuery, as mentioned in this SO question : 另一种解决方案可能是尝试通过Google BigQuery访问数据，如本SO问题所述：

Google N-Gram Web API Google N-Gram Web API

使用getURL刮取https网站

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-10-19 14:12:48

使用getURL刮取https网站

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-10-19 14:12:48

解决方案1
1 已采纳 2013-10-19 14:12:48