從網頁內抓取數據<div>用 R 標記

Question

我想從網頁上抓取產品名稱和評級。 檢查元素后，我知道我需要從product__title和attraqt-star-rating-stars__bar獲取數據。 但我不知道該怎么做，因為它嵌入在多層標簽中。 我試過以下無濟於事; 歡迎提出任何建議。

library(rvest)
library(dplyr)
url = 'https://www.chemistwarehouse.com.au/shop-online/159/oral-hygiene-and-dental-care'
stores <- read_html(url) 

stores %>% html_nodes('body') %>% 
  html_nodes('.product__title') %>% 
  rvest::html_text()

stores %>% html_nodes('body') %>% 
  html_nodes('attraqt-star-rating-stars__bar') %>% 
  rvest::html_text()

Answer 1

數據是從 API 調用中動態提取的。 由於返回的 json 是嵌套的，因此您需要提取所需的信息，例如，通過編寫幾個用戶定義的函數。

我首先提取列表（產品列表），然后有一個函數get_info ，它接受單個產品列表並提取標題和評級並返回一個tibble 。 由於評分可能出現的索引可能會有所不同，我有一個額外的輔助函數get_rating_index ，它動態檢索評分的正確索引。 此函數將索引傳遞回get_info 。

我申請get_info過的產品信息，列表listings ，使用map_dfr以產生最終的DataFrame從每個tibble 。

library(jsonlite)
library(purrr)
library(dplyr)

data <- jsonlite::read_json("https://www.chemistwarehouse.com.au/searchapi/webapi/search/category?category=159&index=0&sort=")

listings <- data$universes$universe[[1]]$`items-section`$items$item

get_info <- function(listing) {
  tibble(
    title = listing$attribute[[2]]$value[[1]]$value,
    rating = listing$attribute[[get_rating_index(listing$attribute)]]$value[[1]]$value %>% as.numeric()
    ) -> t
  return(t)
}

get_rating_index <-function(attribute){
  return(match(T, map(attribute, ~{.x$name == 'bv_star_rating'})))
}

dental_product_ratings <- purrr::map_dfr(listings, get_info)

從網頁內抓取數據<div>用 R 標記

問題描述

1 個解決方案

解決方案1
1 2021-07-19 08:37:39

從網頁內抓取數據<div>用 R 標記

問題描述

1 個解決方案

解決方案1 1 2021-07-19 08:37:39

解決方案1
1 2021-07-19 08:37:39