如何使用R從網站源代碼/ html抓取信息？

Question

我是網絡爬蟲的新手，我正在嘗試構建一個使用R訪問網站源代碼/ html中信息的刮板。

具體來說，我希望能夠確定一個（多個）網站是否具有帶有特定文本的ID：“ google_ads_iframe”。 id總是比這個更長，所以我想我必須使用通配符。

我嘗試了幾種選擇（請參閱下文），但到目前為止沒有任何效果。

第一種方法：

doc <- htmlTreeParse("http://www.funda.nl/") 

data <- xpathSApply(doc, "//div[contains(@id, 'google_ads_iframe')]", xmlValue, trim = TRUE)

錯誤消息顯示為：

Error in UseMethod("xpathApply") : 
  no applicable method for 'xpathApply' applied to an object of class "XMLDocumentContent"

第二種方法：

scrapestuff <- scrape(url = "http://www.funda.nl/", parse = T, headers = T)

x <- xpathSApply(scrapestuff[[1]],"//div[contains(@class, 'google_ads_iframe')]",xmlValue)

x返回為空列表。

第三種方法：

scrapestuff <- read_html("http://www.funda.nl/")
hh <- htmlParse(scrapestuff, asText=T)
x <- xpathSApply(hh,"//div[contains(@id, 'google_ads_iframe')]",xmlValue)

同樣，x返回為空列表。

我無法弄清楚自己在做什么錯，因此任何幫助都將非常棒！

Answer 1

我的廣告攔截器可能阻止了我看到google ads iframe，但是您不必浪費時間使用其他R函數來測試事物的存在。 讓libxml2的優化C函數（為rvest和xml2包提供基礎）為您完成工作，並用boolean()包裝XPath：

library(xml2)

pg <- read_html("http://www.funda.nl/")

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'featured')])")
## [1] TRUE

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'futured')])")
## [1] FALSE

您需要處理的另一個問題是，使用javascript頁面加載后，很可能會生成google ads iframe，這意味着使用RSelenium來獲取頁面源（然后可以將這種方法與結果頁面源一起使用）。

UPDATE

我發現其中包含google_ads_iframe的網頁示例：

pg <- read_html("http://codepen.io/anon/pen/Jtizx.html")

xml_find_lgl(pg, "boolean(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] TRUE

xml_find_first(pg, "count(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] 3

但是，那是一個渲染的頁面，我懷疑您仍然需要使用RSelenium來進行頁面抓取。 這樣做的方法（如果您使用的是合理的操作系統並安裝了phantomjs，否則請在Firefox中使用它）：

library(RSelenium)
RSelenium::startServer()
phantom_js <- phantom(pjs_cmd='/usr/local/bin/phantomjs', extras=c("--ssl-protocol=any"))
capabilities <- list(phantomjs.page.settings.userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.3")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities=capabilities)
remDr$open()

remDr$navigate(URL)
raw_html <- remDr$getPageSource()[[1]]

pg <- read_html()
...

# eventually (when done)
phantom_js$stop()

注意

我與Codepen示例一起使用的XPath（因為它具有google ads iframe）是必要的。 這是iframe所在的代碼段：

<div id="div-gpt-ad-1379506098645-3" style="width:720px;margin-left:auto;margin-right:auto;display:none;">
  <script type="text/javascript">
  googletag.cmd.push(function() { googletag.display('div-gpt-ad-1379506098645-3'); });
  </script>
  <iframe id="google_ads_iframe_/16833175/SmallPS_0" name="google_ads_iframe_/16833175/SmallPS_0" width="723" height="170" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" src="javascript:&quot;<html><body style='background:transparent'></body></html>&quot;" style="border: 0px; vertical-align: bottom;"></iframe></div>

iframe標記是div的子級，因此，如果要先定位div在其中找到屬性，則必須添加子級目標。

如何使用R從網站源代碼/ html抓取信息？

問題描述

1 個解決方案

解決方案1
2 已采納 2016-08-23 11:18:00

如何使用R從網站源代碼/ html抓取信息？

問題描述

1 個解決方案

解決方案1 2 已采納 2016-08-23 11:18:00

解決方案1
2 已采納 2016-08-23 11:18:00