简体   繁体   中英

Web Scraping Education Data in R

Was presented a problem at work and am trying to think / work my way through it. However, I am very new at web scraping, and need some help, or just good starting points, on web scraping.

I have a website from the education commission.

http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA

This site contains 50 tables, one for each state, with two columns in a question / answer format. My first attempt has been this...

library(tidyverse)
library(httr)
library(XML)

tibble(url = "http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA") %>% 
  mutate(get_data = map(.x = url,
                        ~GET(.x))) %>% 
  mutate(list_data = map(.x = get_data,
                         ~readHTMLTable(doc=content(.x, "text")))) %>% 
  pull(list_data)

My first thought was to create multiple dataframes, one for each state, in a list format.

This idea does not seem to have worked as anticipated. I was expecting a list, but it seems like a list of on response rather than 50. It appears that this one response read each line, but did not differentiate from one table to the next. Confused on next steps, anyone with any ideas? Web Scraping is odd to me.

Second attempt was to copy and paste the table into R as a tribble, one state at a time. This sort of worked, but not every column is formatted the same way. Attempted to use tidyr::separate() to break up the columns by "/t" and that worked for some columns, but not all.

Any help on this problem, or even just where to look to learn more about web scraping, would be very helpful. This did not seem all the difficult at first, but seems like there are a couple of things I am am missing. Maybe rvest? Have never used it, but know it is common with web scraping activities.

Thanks in advance!

As you already guessed rvest is a very good choice for web scraping. Using rvest you can get the table from your desired website in just two steps. With some additional data wrangling this could be transformed in a nice data frame.

library(rvest)
#> Loading required package: xml2
library(tidyverse)

html <- read_html("http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA")

df <- html %>% 
  html_table(fill = TRUE, header = FALSE) %>% 
  .[[1]] %>% 
  # Remove empty rows and rows containing the table header
  filter(!(X1 == "" & X2 == ""), !(grepl("^Dual", X1) & grepl("^Dual", X2))) %>% 
  # Create state column 
  mutate(is_state = X1 == X2, state = ifelse(is_state, X1, NA_character_)) %>% 
  fill(state) %>% 
  filter(!is_state) %>% 
  select(-is_state)
head(df, 2)
#>                               X1
#> 1      Statewide policy in place
#> 2 Definition or title of program
#>                                                                                                                                                                  X2
#> 1                                                                                                                                                               Yes
#> 2 Dual Enrollment – Postsecondary Institutions. High school students are allowed to take college courses for credit either at a high school or on a college campus.
#>     state
#> 1 Alabama
#> 2 Alabama

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM