使用 R 進行網頁抓取。我想從網站中提取一些類似數據的表格

Question

我在從網站上抓取數據時遇到了一些問題。 我對網絡抓取沒有太多經驗。 我的計划是使用 R 從以下網站抓取一些數據： https://www.shipserv.com/supplier/profile/s/ww-grainger-inc-59787/brands

更准確地說，我想提取右側的品牌。

到目前為止我的想法：

brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%         html_nodes(xpath='/html/body/div[1]/div/div[2]/div[2]/div[2]/div[4]/div/div/div[3]/div/div[1]/div') %>% html_text()

但這並沒有帶來預期的信息。 一些幫助將在這里非常感激！ 謝謝！

Answer 1

該數據是從腳本標簽中動態提取的。 您可以提取該腳本標簽的內容並將其解析為 json。 僅針對返回列表中感興趣的項目的子集，然后提取品牌名稱：

library(rvest)
library(jsonlite)
library(stringr)

data <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% 
  html_node('#__NEXT_DATA__') %>% html_text() %>% 
  jsonlite::parse_json()

data <- data$props$pageProps$apolloState
mask <- map(names(data), str_detect, '^Brand:') %>% unlist()  
data <- subset(data, mask)
brands <- lapply(data, function(x){x$name})

我發現上面的內容更容易閱讀，但您可以嘗試其他方法，例如

library(rvest)
library(jsonlite)
library(stringr)

brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% 
  html_node('#__NEXT_DATA__') %>% html_text() %>% 
  jsonlite::parse_json() %>% 
  {.$props$pageProps$apolloState} %>% 
  subset(., {str_detect(names(.), 'Brand:')}) %>% 
  lapply(. , function(x){x$name})

使用 {} 將 call 視為表達式而不是 function 是我在@asachet的評論中讀到的

使用 R 進行網頁抓取。我想從網站中提取一些類似數據的表格

問題描述

1 個解決方案

解決方案1
0 已采納 2021-03-17 21:42:00

使用 R 進行網頁抓取。 我想從網站中提取一些類似數據的表格

問題描述

1 個解決方案

解決方案1 0 已采納 2021-03-17 21:42:00

使用 R 進行網頁抓取。我想從網站中提取一些類似數據的表格

解決方案1
0 已采納 2021-03-17 21:42:00