简体   繁体   中英

Web-Scraping using R. I want to extract some table like data from a website

I'm having some problems scraping data from a website. I have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.shipserv.com/supplier/profile/s/ww-grainger-inc-59787/brands

More precisely, I want to extract the brands on the right-hand side.

My idea so far:

brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%         html_nodes(xpath='/html/body/div[1]/div/div[2]/div[2]/div[2]/div[4]/div/div/div[3]/div/div[1]/div') %>% html_text()

But this doesn't bring up the intended information. Some help would be really appreciated here! Thanks!

That data is dynamically pulled from a script tag. You can pull the content of that script tag and parse as json. subset just for the items of interest from the returned list and then extract the brand names:

library(rvest)
library(jsonlite)
library(stringr)

data <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% 
  html_node('#__NEXT_DATA__') %>% html_text() %>% 
  jsonlite::parse_json()

data <- data$props$pageProps$apolloState
mask <- map(names(data), str_detect, '^Brand:') %>% unlist()  
data <- subset(data, mask)
brands <- lapply(data, function(x){x$name})

I find the above easier to read but you could try other methods such as

library(rvest)
library(jsonlite)
library(stringr)

brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% 
  html_node('#__NEXT_DATA__') %>% html_text() %>% 
  jsonlite::parse_json() %>% 
  {.$props$pageProps$apolloState} %>% 
  subset(., {str_detect(names(.), 'Brand:')}) %>% 
  lapply(. , function(x){x$name})
  

Using {} to have call be treated like an expression and not a function is something I read in a comment by @asachet

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM