简体   繁体   English

如何从 R 中的多个页面的网站上抓取特定信息

[英]How to scrape specific information from website with several pages in R

I have just started with web scraping in R and I have trouble finding out how to scrape specific information from a website with several pages without having to do run the code for each individual url.我刚刚开始在 R 中进行 web 抓取,但我无法找到如何从具有多个页面的网站中抓取特定信息,而无需为每个单独的 Z572D4E421E5E6B9BC111D815E8A02 运行代码。 So far I have managed to do it for the first page using this example: https://towardsdatascience.com/tidy-web-scraping-in-r-tutorial-and-resources-ac9f72b4fe47 .到目前为止,我已经设法使用这个例子为第一页做到了: https://towardsdatascience.com/tidy-web-scraping-in-r-tutorial-and-resources-ac9f72b4fe47

I have also managed to generate the urls based on pagenumber with this code:我还设法使用以下代码根据页码生成网址:


list_of_pages <- str_c(url, '?page=', 1:32)

The problem is to integrate this and use the generated urls to get the information I need using one function and store it in a dataframe.问题是集成它并使用生成的 url 来获取我需要的信息,使用 function 并将其存储在 dataframe 中。 This is the code I have for scraping the information:这是我用于抓取信息的代码:

hot100page <- "https://www.billboard.com/charts/hot-100"
hot100 <- read_html(hot100page)

rank <- hot100 %>% 
  rvest::html_nodes('body') %>% 
  xml2::xml_find_all("//span[contains(@class, 'chart-element__rank__number')]") %>% 
  rvest::html_text()

This is an example of the sturture of the website i plan to use the function for: https://www.amazon.com/s?k=statistics&ref=nb_sb_noss_2 .这是我计划使用 function 的网站结构示例: https://www.amazon.com/s?k=statistics&ref=nb_sb_noss_2

I suggest you to use RSelenium .我建议你使用RSelenium

Below a possible solution.下面是一个可能的解决方案。

#Start the library
library(RSelenium) 

#Start a selenium server and browser (you have to select it) 
driver <- rsDriver(browser=c("firefox"), port = 4567L)

#Defines the client part.
remote_driver <- driver[["client"]]

#Sent the web site address to the firefox 
remote_driver$navigate("https://www.amazon.com/s?k=statistics&ref=nb_sb_noss_2.")

#a empty list to save the data
all_books<-list()
#a loop to click next
for (i in 1:20) {
  #sleeps to wait that the page is available
  Sys.sleep(3)
  #finds in the css environment the body 
  scroll_d <- remote_driver$findElement(using = "css", value = "body")
  #sends to the browser to go to the end of the page
  scroll_d$sendKeysToElement(list(key = "end"))
  #gets all books, price, ranking, etc
  all_books[i]<-remote_driver$findElement(using = 'css selector', value = 'span.s-latency-cf-section:nth-child(4)')$getElementText()
  #pushes the button next
  next_bottom<-remote_driver$findElement(using = 'css selector',value = '.a-last')
  next_bottom$clickElement()
}

head(all_books)
[[1]]
[1] "1\nNew\nLife Goes On\nBTS\n-\n1\n1\n2\nFailing\nMood\n24kGoldn Featuring iann dior

Here's a way to do it using rvest.这是使用 rvest 的一种方法。 Keep in mind, the particular website (hot100) doesn't actually use pagination, so the ?page=1 etc part of the url is meaningless (it just keeps loading the homepage).请记住,特定网站(hot100)实际上并不使用分页,因此 url 的?page=1等部分没有意义(它只是不断加载主页)。 But for sites with pagination, this would work但是对于分页的网站,这会起作用

library(tidyverse)
library(rvest)
hot100page <- "https://www.billboard.com/charts/hot-100"
hot100 <- read_html(hot100page)


df <- data.frame(rank=character(), somethingelse=character())

rank <- c()

for(i in 1:32) {

  print(paste0("Scraping page ", i))
  
  temp <- paste0(hot100page,  '?page=', i) %>% 
    read_html %>% 
    rvest::html_nodes('body') %>% 
    xml2::xml_find_all("//span[contains(@class, 'chart-element__rank__number')]") %>% 
    rvest::html_text()
  
  
  rank <- c(rank, temp)
}


df$rank <- rank
df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM