简体   繁体   中英

Using Rvest to scrape text, table, and combine the two from multiple pages

I have a situation where i want to scrape multiple tables across different urls. I did manage to scrape one page, but my function is failing when i try to scrape across pages and stack the tables as a dataframe/list.

library(rvest)
library(tidyverse)
library(purrr)

   index <-225:227
          urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)
          
         
          get_gram <- function(url){
               urls %>%
                    read_html() %>%
                    html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
                    html_text() -> temp
               urls %>% 
                    read_html() %>%
                    html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>% 
                    html_table() %>% 
                    as.data.frame() %>% add_column(newcol=str_c(temp))
          }
#results <- map_df(urls,get_gram) Have commented this out, but this is what i 
# used to get the table when the index just had one element and it worked.

results <- list()
results[[i]] <- map_df(urls,get_gram)

I think I am faltering at the step where i must stack the map_df output and I thank you in advance for your help!

You are passing url to the function and using urls in the body of the function. Try this version :

library(rvest)
library(dplyr)

index <-225:227
urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)

get_gram <- function(url){
  webpage <- url %>%  read_html() 
  webpage %>%
    html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
    html_text() -> temp
  webpage %>%
    html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>% 
    html_table() %>% 
    as.data.frame() %>% add_column(newcol=temp)
}
result <- purrr::map_df(urls,get_gram)

Consider this approach. We only need to use html_node because your code suggests that there is only one table per page to scrape.

library(tidyverse)
library(rvest)

get_title <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/a[2]') %>% html_text()
get_table <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/table') %>% html_table()

urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", 225:227)

tibble(urls) %>% 
  mutate(
    page = map(urls, read_html), 
    newcol = map_chr(page, get_title), 
    data = map(page, get_table), 
    page = NULL, urls = NULL
  ) %>% 
  unnest(data)

Output

# A tibble: 52 x 7
   newcol                                           `Ward No.` `Ward Name`      `Elected Members` Role      Party  Reservation
   <chr>                                                 <int> <chr>            <chr>             <chr>     <chr>  <chr>      
 1 Thiruvananthapuram - Chemmaruthy Grama Panchayat          1 VANDIPPURA       BABY P            Member    CPI(M) Woman      
 2 Thiruvananthapuram - Chemmaruthy Grama Panchayat          2 PALAYAMKUNNU     SREELATHA D       Member    INC    Woman      
 3 Thiruvananthapuram - Chemmaruthy Grama Panchayat          3 KOVOOR           KAVITHA V         Member    INC    Woman      
 4 Thiruvananthapuram - Chemmaruthy Grama Panchayat          4 SIVAPURAM        ANIL. V           Member    INC    General    
 5 Thiruvananthapuram - Chemmaruthy Grama Panchayat          5 MUTHANA          JAYALEKSHMI S     Member    INC    Woman      
 6 Thiruvananthapuram - Chemmaruthy Grama Panchayat          6 MAVINMOODU       S SASIKALA NATH   Member    CPI(M) Woman      
 7 Thiruvananthapuram - Chemmaruthy Grama Panchayat          7 NJEKKADU         P.MANILAL         Member    INC    General    
 8 Thiruvananthapuram - Chemmaruthy Grama Panchayat          8 CHEMMARUTHY      SASEENDRA         President INC    Woman      
 9 Thiruvananthapuram - Chemmaruthy Grama Panchayat          9 PANCHAYAT OFFICE PRASANTH PANAYARA Member    INC    General    
10 Thiruvananthapuram - Chemmaruthy Grama Panchayat         10 VALIYAVILA       SANJAYAN S        Member    INC    General    
# ... with 42 more rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM