简体   繁体   中英

Rselenium web scraping : apply as function

I've been trying to resolve this the whole day and I can't figure out a solution. Please help !! So to learn web scraping, I've been practicing on this website :

https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi

The goal is to scrape the price of EVERY PRODUCT. So, thanks to the ressources on this website and other internet users, I made this code that works perfectly :

option <- remDr$findElement(using = 'xpath', "//*/option[@value = 'view_all']")
option$clickElement()
priceNodes <- remDr$findElements(using = 'css selector', ".price")
price<-unlist(lapply(priceNodes, function(x){x$getElementText()}))
price<-gsub("€","",price)
price<-gsub(",","",price)
price <- as.numeric(price)

So with this I got the result that I want, which is a list of 204 values (price). Now I'd like to transform this entire process into a function in order to apply this function to a list of adresse (in this case to other brands). And obviously it did not work ... :

FPrice <- function(x) {
  url1 <- x
  remDr <- rD$client
  remDr$navigate(url1)
  iframe <- remDr$findElement("css", value=".view-more-less")
  option <- remDr$findElement(using = 'xpath', "//*/option[@value = 'view_all']")
  option$clickElement()
  priceNodes <- remDr$findElements(using = 'css selector', ".price")
  price<-unlist(lapply(priceNodes, function(x){x$getElementText()}))
  }

When I apply it like this :

    FPrice("https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi")

Error message came up and I don't get the data that I am looking for :

Selenium message:stale element reference: element is not attached to the page document
      (Session info: chrome=61.0.3163.100)
      (Driver info: chromedriver=2.33.506106 (8a06c39c4582fbfbab6966dbb1c38a9173bfb1a2),platform=Mac OS X 10.12.6 x86_64)

I think it is because there is a function inside of the function... Can anyone please help me the resolve the problem ? Thanks.

Ps. With rvest I made another code :

Price <- function(x) {
  url1 <- x
webpage <- read_html(url1)
price_data_html <- html_nodes(webpage,".price")
price_data <- html_text(price_data_html)
price_data<-gsub("€","",price_data)
price_data<-gsub(",","",price_data)
price_data <- as.numeric(price_data)
return(price_data)
}

And it worked fine. I even applied it to a vector containing a list of adresse. However, with rvest I can not get to configure the browser so it select the option "show all". Thus I only get 60 observations while some brands propose more than 200 product, like the case of Fendi.

Thank you very much for your patience. Hope to read from you very soon !

Astoundingly (I verified this) the site does not explicitly prevent scraping in the Terms & Conditions and they left the /fr/fr path out of their robots.txt exclusions. ie you got lucky . This is likely an oversight on their part.

However, there is a non-Selenium approach to this. The main page loads the product <div> s via XHR calls, so find that via browser Developer Tools "Network" tab inspection and you can scrape away either page by page or completely. Here are the required 📦s:

library(httr)
library(rvest)
library(purrr)

For the paginated approach, we setup a function:

get_prices_on_page <- function(pg_num = 1) {

  Sys.sleep(5) # be kind 

  GET(
    url = "https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi",
    query = list(
      view = "jsp",
      sale = "0",
      exclude = TRUE,
      pn = pg_num,
      npp=60,
      image_view = "product",
      dScroll = "0"
    ),
  ) -> res

  pg <- content(res, as="parsed")

  list(
    total_pgs = html_node(pg, "div.data_totalPages") %>% xml_integer(),
    total_items = html_node(pg, "data_totalItems") %>% xml_integer(),
    prices_on_page = html_nodes(pg, "span.price") %>% 
      html_text() %>% 
      gsub("[^[:digit:]]", "", .) %>% 
      as.numeric()
  )

} 

Then get the first page:

prices <- get_prices_on_page(1)

and, then iterate over till we're done, smushing everything together:

c(prices$prices_on_page, map(2:prices$total_pgs, get_prices_on_page) %>%
  map("prices_on_page") %>% 
  flatten_dbl()) -> all_prices

all_prices
##   [1]   601  1190  1700  1480  1300   590   950  1590  3200   410   950   595  1100   690
##  [15]   900   780  2200   790  1300   410  1000  1480   750   495   850   850   900   450
##  [29]  1600  1750  2200   750   750  1550   750   850  1900  1190  1200  1650  2500   580
##  [43]  2000  2700  3900  1900   600  1200   650   950   600   800  1100  1200  1000  1100
##  [57]  2500  1000   500  1645   550  1505   850  1505   850  2000   400   790   950   800
##  [71]   500  2000   500  1300   350   550   290   550   450  2700  2200   650   250   200
##  [85]  1700   250   250   300   450   800   800   800   900   600   900   375  5500  6400
##  [99]  1450  3300  2350  1390  2700  1500  1790  2200  3500  3100  1390  1850  5000  1690
## [113]  2700  4800  3500  6200  3100  1850  1950  3500  1780  2000  1550  1280  3200  1350
## [127]  2700  1350  1980  3900  1580 18500  1850  1550  1450  1600  1780  1300  1980  1450
## [141]  1320  1460   850  1650   290   190   520   190  1350   290   850   900   480   450
## [155]   850   780  1850   750   450  1100  1550   550   495   850   890   850   590   595
## [169]   650   650   495   595   330   480   400   220   130   130   290   130   250   230
## [183]   210   900   380   340   430   380   370   390   460   255   300   480   550   410
## [197]   350   350   280   190   350   550   450   430

Or, we can get them all in one, fell swoop by using that "view all on one page" feature the site has:

pg <- read_html("https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi?view=jsp&sale=0&exclude=true&pn=1&npp=view_all&image_view=product&dScroll=0")
html_nodes(pg, "span.price") %>% 
  html_text() %>% 
  gsub("[^[:digit:]]", "", .) %>% 
  as.numeric() -> all_prices

all_prices
# same result as above

Please keep the crawl delay in if you use the paginated approach and please don't misuse the content. While they don't disallow scraping, the T&C says it for personal product choosing use only.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM