I am pretty new to webscraping and I am trying to build a scraper that accesses information in the website's source code/html using R.
Specifically, I want to be able to determine whether a (number of) website(s) has an id with a certain text: "google_ads_iframe". The id will always be longer than this, so I think I will have to use a wildcard.
I have tried several options (see below), but so far nothing has worked.
1st method:
doc <- htmlTreeParse("http://www.funda.nl/")
data <- xpathSApply(doc, "//div[contains(@id, 'google_ads_iframe')]", xmlValue, trim = TRUE)
Error message reads:
Error in UseMethod("xpathApply") :
no applicable method for 'xpathApply' applied to an object of class "XMLDocumentContent"
2nd method:
scrapestuff <- scrape(url = "http://www.funda.nl/", parse = T, headers = T)
x <- xpathSApply(scrapestuff[[1]],"//div[contains(@class, 'google_ads_iframe')]",xmlValue)
x returns as an empty list.
3rd method:
scrapestuff <- read_html("http://www.funda.nl/")
hh <- htmlParse(scrapestuff, asText=T)
x <- xpathSApply(hh,"//div[contains(@id, 'google_ads_iframe')]",xmlValue)
Again, x is returned as an empty list.
I can't figure out what I am doing wrong, so any help would be really great!
My ad blocker is probably preventing me from seeing google ads iframes, but you don't have to waste cycles with additional R functions to test for the presence of something. Let the optimized C functions in libxml2
(which underpins rvest
and the xml2
package) do the work for you and just wrap your XPath with boolean()
:
library(xml2)
pg <- read_html("http://www.funda.nl/")
xml_find_lgl(pg, "boolean(.//div[contains(@class, 'featured')])")
## [1] TRUE
xml_find_lgl(pg, "boolean(.//div[contains(@class, 'futured')])")
## [1] FALSE
One other issue you'll need to deal with is that the google ads iframes are most likely being generated after page-load with javascript, which means using RSelenium to grab the page source (you can then use this method with the resultant page source).
UPDATE
I found a page example with google_ads_iframe
in it:
pg <- read_html("http://codepen.io/anon/pen/Jtizx.html")
xml_find_lgl(pg, "boolean(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] TRUE
xml_find_first(pg, "count(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] 3
That's a rendered page, though, and I suspect you'll still need to use RSelenium to do the page grabbing. Here's how to do that (if you're on a reasonable operating system and have phantomjs installed, otherwise use it with Firefox):
library(RSelenium)
RSelenium::startServer()
phantom_js <- phantom(pjs_cmd='/usr/local/bin/phantomjs', extras=c("--ssl-protocol=any"))
capabilities <- list(phantomjs.page.settings.userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.3")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities=capabilities)
remDr$open()
remDr$navigate(URL)
raw_html <- remDr$getPageSource()[[1]]
pg <- read_html()
...
# eventually (when done)
phantom_js$stop()
NOTE
The XPath I used with the codepen example (since it has a google ads iframe) was necessary. Here's the snippet where the iframe exists:
<div id="div-gpt-ad-1379506098645-3" style="width:720px;margin-left:auto;margin-right:auto;display:none;">
<script type="text/javascript">
googletag.cmd.push(function() { googletag.display('div-gpt-ad-1379506098645-3'); });
</script>
<iframe id="google_ads_iframe_/16833175/SmallPS_0" name="google_ads_iframe_/16833175/SmallPS_0" width="723" height="170" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" src="javascript:"<html><body style='background:transparent'></body></html>"" style="border: 0px; vertical-align: bottom;"></iframe></div>
The iframe
tag is a child of the div
so if you want to target the div
first you then have to add the child target if you want to find an attribute in it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.