Rselenium网页抓取：作为功能应用

Question

我整天都在尝试解决此问题，但找不到解决方案。 请帮忙！！ 因此，要学习网络抓取，我一直在此网站上进行练习：

https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi

目的是刮擦每个产品的价格。 因此，多亏了此网站和其他互联网用户的资源，我使这段代码非常有效：

option <- remDr$findElement(using = 'xpath', "//*/option[@value = 'view_all']")
option$clickElement()
priceNodes <- remDr$findElements(using = 'css selector', ".price")
price<-unlist(lapply(priceNodes, function(x){x$getElementText()}))
price<-gsub("€","",price)
price<-gsub(",","",price)
price <- as.numeric(price)

因此，我得到了想要的结果，它是204个值（价格）的列表。 现在，我想将整个过程转换为一个函数，以便将该函数应用于地址列表（在本例中为其他品牌）。 而且显然它不起作用...：

FPrice <- function(x) {
  url1 <- x
  remDr <- rD$client
  remDr$navigate(url1)
  iframe <- remDr$findElement("css", value=".view-more-less")
  option <- remDr$findElement(using = 'xpath', "//*/option[@value = 'view_all']")
  option$clickElement()
  priceNodes <- remDr$findElements(using = 'css selector', ".price")
  price<-unlist(lapply(priceNodes, function(x){x$getElementText()}))
  }

当我像这样应用它时：

    FPrice("https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi")

出现错误消息，但我未获取所需的数据：

Selenium message:stale element reference: element is not attached to the page document
      (Session info: chrome=61.0.3163.100)
      (Driver info: chromedriver=2.33.506106 (8a06c39c4582fbfbab6966dbb1c38a9173bfb1a2),platform=Mac OS X 10.12.6 x86_64)

我认为这是因为函数内部有一个函数...谁能帮我解决问题？ 谢谢。

PS。 使用rvest我编写了另一个代码：

Price <- function(x) {
  url1 <- x
webpage <- read_html(url1)
price_data_html <- html_nodes(webpage,".price")
price_data <- html_text(price_data_html)
price_data<-gsub("€","",price_data)
price_data<-gsub(",","",price_data)
price_data <- as.numeric(price_data)
return(price_data)
}

而且效果很好。 我什至将其应用于包含地址列表的向量。 但是，使用rvest时，我无法配置浏览器，因此它选择了“显示全部”选项。 因此，我只得到60个观察结果，而有些品牌提出了200多种产品，例如Fendi。

非常感谢您的耐心配合。 希望很快能收到您的来信！

Answer 1

令人惊讶的是（我证实了这一点），该网站并未明确禁止条款和条件中的抓取，并且他们将/fr/fr路径排除在了robots.txt排除范围之外。 也就是说，你很幸运。 这可能是他们的疏忽。

但是，存在非硒方法。 主页通过XHR调用加载产品<div> ，因此通过浏览器“开发人员工具”的“网络”选项卡检查可以发现，您可以逐页或完全删除。 以下是必填项：

library(httr)
library(rvest)
library(purrr)

对于分页方法，我们设置一个函数：

get_prices_on_page <- function(pg_num = 1) {

  Sys.sleep(5) # be kind 

  GET(
    url = "https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi",
    query = list(
      view = "jsp",
      sale = "0",
      exclude = TRUE,
      pn = pg_num,
      npp=60,
      image_view = "product",
      dScroll = "0"
    ),
  ) -> res

  pg <- content(res, as="parsed")

  list(
    total_pgs = html_node(pg, "div.data_totalPages") %>% xml_integer(),
    total_items = html_node(pg, "data_totalItems") %>% xml_integer(),
    prices_on_page = html_nodes(pg, "span.price") %>% 
      html_text() %>% 
      gsub("[^[:digit:]]", "", .) %>% 
      as.numeric()
  )

}

然后获得第一页：

prices <- get_prices_on_page(1)

然后遍历直到完成，将所有内容混在一起：

c(prices$prices_on_page, map(2:prices$total_pgs, get_prices_on_page) %>%
  map("prices_on_page") %>% 
  flatten_dbl()) -> all_prices

all_prices
##   [1]   601  1190  1700  1480  1300   590   950  1590  3200   410   950   595  1100   690
##  [15]   900   780  2200   790  1300   410  1000  1480   750   495   850   850   900   450
##  [29]  1600  1750  2200   750   750  1550   750   850  1900  1190  1200  1650  2500   580
##  [43]  2000  2700  3900  1900   600  1200   650   950   600   800  1100  1200  1000  1100
##  [57]  2500  1000   500  1645   550  1505   850  1505   850  2000   400   790   950   800
##  [71]   500  2000   500  1300   350   550   290   550   450  2700  2200   650   250   200
##  [85]  1700   250   250   300   450   800   800   800   900   600   900   375  5500  6400
##  [99]  1450  3300  2350  1390  2700  1500  1790  2200  3500  3100  1390  1850  5000  1690
## [113]  2700  4800  3500  6200  3100  1850  1950  3500  1780  2000  1550  1280  3200  1350
## [127]  2700  1350  1980  3900  1580 18500  1850  1550  1450  1600  1780  1300  1980  1450
## [141]  1320  1460   850  1650   290   190   520   190  1350   290   850   900   480   450
## [155]   850   780  1850   750   450  1100  1550   550   495   850   890   850   590   595
## [169]   650   650   495   595   330   480   400   220   130   130   290   130   250   230
## [183]   210   900   380   340   430   380   370   390   460   255   300   480   550   410
## [197]   350   350   280   190   350   550   450   430

或者，我们可以使用该网站具有的“在一页上查看全部”功能，将它们全部合而为一：

pg <- read_html("https://www.net-a-porter.com/fr/fr/Shop/Designers/Fendi?view=jsp&sale=0&exclude=true&pn=1&npp=view_all&image_view=product&dScroll=0")
html_nodes(pg, "span.price") %>% 
  html_text() %>% 
  gsub("[^[:digit:]]", "", .) %>% 
  as.numeric() -> all_prices

all_prices
# same result as above

如果您使用分页方法，请保持抓取延迟，并且请勿滥用内容。 虽然他们不禁止刮削，但T＆C表示仅用于个人产品选择用途。

Rselenium网页抓取：作为功能应用

问题描述

1 个解决方案

解决方案1
1 2017-10-21 17:03:49

Rselenium网页抓取：作为功能应用

问题描述

1 个解决方案

解决方案1 1 2017-10-21 17:03:49

解决方案1
1 2017-10-21 17:03:49