如何使用R从网站源代码/ html抓取信息？

Question

I am pretty new to webscraping and I am trying to build a scraper that accesses information in the website's source code/html using R. 我是网络爬虫的新手，我正在尝试构建一个使用R访问网站源代码/ html中信息的刮板。

Specifically, I want to be able to determine whether a (number of) website(s) has an id with a certain text: "google_ads_iframe". 具体来说，我希望能够确定一个（多个）网站是否具有带有特定文本的ID：“ google_ads_iframe”。 The id will always be longer than this, so I think I will have to use a wildcard. id总是比这个更长，所以我想我必须使用通配符。

I have tried several options (see below), but so far nothing has worked. 我尝试了几种选择（请参阅下文），但到目前为止没有任何效果。

1st method: 第一种方法：

doc <- htmlTreeParse("http://www.funda.nl/") 

data <- xpathSApply(doc, "//div[contains(@id, 'google_ads_iframe')]", xmlValue, trim = TRUE)

Error message reads: 错误消息显示为：

Error in UseMethod("xpathApply") : 
  no applicable method for 'xpathApply' applied to an object of class "XMLDocumentContent"

2nd method: 第二种方法：

scrapestuff <- scrape(url = "http://www.funda.nl/", parse = T, headers = T)

x <- xpathSApply(scrapestuff[[1]],"//div[contains(@class, 'google_ads_iframe')]",xmlValue)

x returns as an empty list. x返回为空列表。

3rd method: 第三种方法：

scrapestuff <- read_html("http://www.funda.nl/")
hh <- htmlParse(scrapestuff, asText=T)
x <- xpathSApply(hh,"//div[contains(@id, 'google_ads_iframe')]",xmlValue)

Again, x is returned as an empty list. 同样，x返回为空列表。

I can't figure out what I am doing wrong, so any help would be really great! 我无法弄清楚自己在做什么错，因此任何帮助都将非常棒！

Answer 1

My ad blocker is probably preventing me from seeing google ads iframes, but you don't have to waste cycles with additional R functions to test for the presence of something. 我的广告拦截器可能阻止了我看到google ads iframe，但是您不必浪费时间使用其他R函数来测试事物的存在。 Let the optimized C functions in libxml2 (which underpins rvest and the xml2 package) do the work for you and just wrap your XPath with boolean() : 让libxml2的优化C函数（为rvest和xml2包提供基础）为您完成工作，并用boolean()包装XPath：

library(xml2)

pg <- read_html("http://www.funda.nl/")

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'featured')])")
## [1] TRUE

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'futured')])")
## [1] FALSE

One other issue you'll need to deal with is that the google ads iframes are most likely being generated after page-load with javascript, which means using RSelenium to grab the page source (you can then use this method with the resultant page source). 您需要处理的另一个问题是，使用javascript页面加载后，很可能会生成google ads iframe，这意味着使用RSelenium来获取页面源（然后可以将这种方法与结果页面源一起使用）。

UPDATE UPDATE

I found a page example with google_ads_iframe in it: 我发现其中包含google_ads_iframe的网页示例：

pg <- read_html("http://codepen.io/anon/pen/Jtizx.html")

xml_find_lgl(pg, "boolean(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] TRUE

xml_find_first(pg, "count(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] 3

That's a rendered page, though, and I suspect you'll still need to use RSelenium to do the page grabbing. 但是，那是一个渲染的页面，我怀疑您仍然需要使用RSelenium来进行页面抓取。 Here's how to do that (if you're on a reasonable operating system and have phantomjs installed, otherwise use it with Firefox): 这样做的方法（如果您使用的是合理的操作系统并安装了phantomjs，否则请在Firefox中使用它）：

library(RSelenium)
RSelenium::startServer()
phantom_js <- phantom(pjs_cmd='/usr/local/bin/phantomjs', extras=c("--ssl-protocol=any"))
capabilities <- list(phantomjs.page.settings.userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.3")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities=capabilities)
remDr$open()

remDr$navigate(URL)
raw_html <- remDr$getPageSource()[[1]]

pg <- read_html()
...

# eventually (when done)
phantom_js$stop()

NOTE 注意

The XPath I used with the codepen example (since it has a google ads iframe) was necessary. 我与Codepen示例一起使用的XPath（因为它具有google ads iframe）是必要的。 Here's the snippet where the iframe exists: 这是iframe所在的代码段：

<div id="div-gpt-ad-1379506098645-3" style="width:720px;margin-left:auto;margin-right:auto;display:none;">
  <script type="text/javascript">
  googletag.cmd.push(function() { googletag.display('div-gpt-ad-1379506098645-3'); });
  </script>
  <iframe id="google_ads_iframe_/16833175/SmallPS_0" name="google_ads_iframe_/16833175/SmallPS_0" width="723" height="170" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" src="javascript:&quot;<html><body style='background:transparent'></body></html>&quot;" style="border: 0px; vertical-align: bottom;"></iframe></div>

The iframe tag is a child of the div so if you want to target the div first you then have to add the child target if you want to find an attribute in it. iframe标记是div的子级，因此，如果要先定位div在其中找到属性，则必须添加子级目标。

如何使用R从网站源代码/ html抓取信息？

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-08-23 11:18:00

如何使用R从网站源代码/ html抓取信息？

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-08-23 11:18:00

解决方案1
2 已采纳 2016-08-23 11:18:00