简体   繁体   中英

Extracting multiple tables from webpage containing hyperlinks with R

This is my first attempt at web scraping. I am trying to extract a list of tables (Column name: Oil & Gas Tables) from this webpage. Oil and Gas Data . It was easy to extract the data for one state by using the link for that state Alabama Data . However, I want a program that can extract data for all states, keeping them year wise as shown in the HTML data. I have loaded up packages RCurl, XML, rlist and purrr based on previous similar posts I came across.

How can I use R (Rcurl/XML packages ?!) to scrape this webpage? This solution looks complete, however, the problem webpage has probably changed since it was posted (I tried to emulate, but couldn't)

R: XPath expression returns links outside of selected element . How Can I use Xpath to extract my needed tables since they all contain "stateinitials_table.html" as in for Alabama "al_table.html" view source

theurl <- getURL("https://www.eia.gov/naturalgas/archive/petrosystem/al_table.html",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
berilium<-tables[seq(3,length(tables),2)]

This is the output for "al_table.html" a list of 15 data frames for 15 years. 26行817列,每年15个数据帧1

So I need to,

make a function (Xpath vs readHTMLTable, preferable Xpath) to extract all tables from the main web link. And I need it 'marked' by state and year as shown in the web pages. (Currently not concerned about clean up of useless columns and rows)

This is more of a blog post or tutorial than an SO answer but I can also appreciate the desire to learn and am also working on a book for this very topic and this seems like a gd example.

library(rvest)
library(tidyverse)

We'll start with the top-level page:

pg <- read_html("https://www.eia.gov/naturalgas/archive/petrosystem/petrosysog.html")

Now, we'll use an XPath that only gets us table rows that have state data in them. Compare the XPath expressions to the tags in the HTML and this should make sense. Find all <tr> s without colspan attributes and only choose remaining <tr> s that have both the right class and a link to a state:

states <- html_nodes(pg, xpath=".//tr[td[not(@colspan) and 
                     contains(@class, 'links_normal') and a[@name]]]") 

data_frame(
  state = html_text(html_nodes(states, xpath=".//td[1]")),
  link = html_attr(html_nodes(states, xpath=".//td[2]/a"), "href")
) -> state_tab

It's in a data frame to keep it tidy and handy.

You'll need to put the next bit below the function that comes after it, but I need to explain the iteration before showing the function.

We need to iterate over each link. In each iteration, we:

  • pause since your needs aren't more important than EIA's server load
  • find all "branch" <div> s since they hold two pieces of information we need (the state+year and the data table for said state+year).
  • wrap it all up in a nice data frame

Rather than clutter up the anonymous function, we'll put that functionality in another function (again, which needs to be defined before this iterator will work):

pb <- progress_estimated(nrow(state_tab))
map_df(state_tab$link, ~{

  pb$tick()$print()

  pg <- read_html(sprintf("https://www.eia.gov/naturalgas/archive/petrosystem/%s", .x))

  Sys.sleep(5) # scrape responsibly

  html_nodes(pg, xpath=".//div[@class='branch']") %>% 
    map_df(extract_table)

}) -> og_df

This is the hard worker of the bunch. We need to find all the State + Year labels on the page (each are in a <table> ) then we need to find the tables with data in them. I take the liberty of removing the explanatory blurb at the bottom of each and also turn each into a tibble (but that's just my class preference):

extract_table <- function(pg) {

  t1 <- html_nodes(pg, xpath=".//../tr[td[contains(@class, 'SystemTitle')]][1]")
  t2 <- html_nodes(pg, xpath=".//table[contains(@summary, 'Report')]")

  state_year <- (html_text(t1, trim=TRUE) %>% strsplit(" "))[[1]]

  xml_find_first(t2, "td[@colspan]") %>% xml_remove()

  html_table(t2, header=FALSE)[[1]] %>% 
    mutate(state=state_year[1], year=state_year[2]) %>% 
    tbl_df()

}

Re-pasting the aforeposted code just to ensure you grok it has to come after the function:

pb <- progress_estimated(nrow(state_tab))
map_df(state_tab$link, ~{

  pb$tick()$print()

  pg <- read_html(sprintf("https://www.eia.gov/naturalgas/archive/petrosystem/%s", .x))

  Sys.sleep(5) # scrape responsibly

  html_nodes(pg, xpath=".//div[@class='branch']") %>% 
    map_df(extract_table)

}) -> og_df

And, it works (you said you'd do the final cleanup separately):

glimpse(og_df)
## Observations: 14,028
## Variables: 19
## $ X1    <chr> "", "Prod.RateBracket(BOE/Day)", "0 - 1", "1 - 2", "2 - 4", "4 - 6", "...
## $ X2    <chr> "", "||||", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|"...
## $ X3    <chr> "Oil Wells", "# ofOilWells", "26", "19", "61", "61", "47", "36", "250"...
## $ X4    <chr> "Oil Wells", "% ofOilWells", "5.2", "3.8", "12.1", "12.1", "9.3", "7.1...
## $ X5    <chr> "Oil Wells", "AnnualOilProd.(Mbbl)", "4.1", "7.8", "61.6", "104.9", "1...
## $ X6    <chr> "Oil Wells", "% ofOilProd.", "0.1", "0.2", "1.2", "2.1", "2.2", "2.3",...
## $ X7    <chr> "Oil Wells", "OilRateper Well(bbl/Day)", "0.5", "1.4", "3.0", "4.9", "...
## $ X8    <chr> "Oil Wells", "AnnualGasProd.(MMcf)", "1.5", "3.5", "16.5", "19.9", "9....
## $ X9    <chr> "Oil Wells", "GasRateper Well(Mcf/Day)", "0.2", "0.6", "0.8", "0.9", "...
## $ X10   <chr> "", "||||", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|"...
## $ X11   <chr> "Gas Wells", "# ofGasWells", "365", "331", "988", "948", "867", "674",...
## $ X12   <chr> "Gas Wells", "% ofGasWells", "5.9", "5.4", "16.0", "15.4", "14.1", "10...
## $ X13   <chr> "Gas Wells", "AnnualGasProd.(MMcf)", "257.6", "1,044.3", "6,360.6", "1...
## $ X14   <chr> "Gas Wells", "% ofGasProd.", "0.1", "0.4", "2.6", "4.2", "5.3", "5.4",...
## $ X15   <chr> "Gas Wells", "GasRateper Well(Mcf/Day)", "2.2", "9.2", "18.1", "30.0",...
## $ X16   <chr> "Gas Wells", "AnnualOilProd.(Mbbl)", "0.2", "0.6", "1.6", "2.0", "2.4"...
## $ X17   <chr> "Gas Wells", "OilRateper Well(bbl/Day)", "0.0", "0.0", "0.0", "0.0", "...
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala...
## $ year  <chr> "2009", "2009", "2009", "2009", "2009", "2009", "2009", "2009", "2009"...

Hope this helps!

library(rvest)
library(dplyr)

main_page <- read_html("https://www.eia.gov/naturalgas/archive/petrosystem/petrosysog.html") 
state <- main_page %>%
  html_nodes(xpath='//td[contains(@width, "110")]') %>%
  html_children() %>%
  html_text()
state_link <- main_page %>%
  html_nodes(xpath='//td[contains(@width, "160")]') %>%
  html_children() %>%
  html_attr('href')

final_data <- list()
for (i in 1:length(state)){
  child_page <- read_html(paste0("https://www.eia.gov/naturalgas/archive/petrosystem/",state_link[i]))
  Sys.sleep(5)

  child_page_stateAndYear <- child_page %>%
    html_nodes(xpath = '//td[@class="c SystemTitle" and @style=" font-size: 14pt; color: #CC0000;"]') %>%
    html_text
  child_page_table <- lapply(
    (child_page %>%
       html_nodes(xpath = '//table[contains(@class, "Table")]') %>%
       html_table()), 
    function(x) x[-nrow(x),])
  final_data[[state[i]]] <- setNames(child_page_table, child_page_stateAndYear)

  print(paste('Scrapped data for', state[i], '...'))
  flush.console()
}
print('Congratulation! You have finished scrapping the required data!')

names(final_data)
names(final_data[[1]])

final_data has the 34 elements (ie states available on the main webpage) and each element is a list of tables (ie year wise tabular data of a state). So if you want to have data belonging to 'Alabama 2009' then just type

final_data[['Alabama']]['Alabama 2009']

(note: You may need to do little bit of data cleaning)

Edit: added logic to sleep for sometime before scrapping another webpage as suggested by @hrbrmstr

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM