简体   繁体   中英

Problems scraping web page in R

I'm tryig to scrape an specific location of a web page using XPath to find it. The path seems to be "hidden" as other parts of the web page are easily reachable, but this section returns a NULL value.

I've tried using several packages, but i'm really not an expert in the subject so i can't really assess what's going on and if the is a way to solve it.

This is what i've tried.

require("XML")
require("scrapeR")
require("httr")

url <- "http://www.claro.com.ar/portal/ar/pc/personas/movil/eq-new/?eq=537"
xp <- '//*[@id="dv_MainContainerEquiposResumen"]/div[1]/h1'

page <- scrape(url)
xpathApply(page[[1]], xp, xmlValue)
# NULL

url.get = GET(url)
xpathSApply(content(url.get), xp)
# NULL

webpage = getURL(url)
doc = htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
xpathSApply(doc, xp)
# NULL

You can scrape the page using Selenium and the RSelenium package:

url <- "http://www.claro.com.ar/portal/ar/pc/personas/movil/eq-new/?eq=537"
xp <- '//*[@id="dv_MainContainerEquiposResumen"]/div[1]/h1'
require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(url)
webElem <- remDr$findElement(value = xp)
> webElem$getElementAttribute("outerHTML")[[1]]
[1] "<h1>Samsung Galaxy Core</h1>"
> webElem$getElementAttribute("innerHTML")[[1]]
[1] "Samsung Galaxy Core"
remDr$close()
remDr$closeServer()

That part of the page appears to be added in later via javascript. It does not exist in the source of the page. I don't think scrapeR evaluates the javascript.

The data appears to come from an AJAX call to http://www.claro.com.ar/portal/ar/ceq/js/ceq.js?ver=1.0.0 . It may be looking at the referer to know what data to send.

It appears that this will work to get that data

library(RCurl)
getURL("http://www.claro.com.ar/portal/ar/ceq/js/ceq.js?ver=1.0.0",
    .opts=curlOptions(referer="http://www.claro.com.ar/portal/ar/pc/personas/movil/eq-new/?eq=537"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM