简体   繁体   中英

How can I scrape a table from a website in R

I'm tying to extract the bottom table ('Daily Observations') from https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1 . I got to the full xpath for the table component but it shows {xml_nodeset (0)} as the output. What am I doing wrong here? I used the following code:

library(rvest)
single <- read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')  
single %>%
  html_nodes(xpath = '/html/body/app-root/app-history/one-column-layout/wu-header/sidenav/mat-sidenav-container/mat-sidenav-content/div/section/div[2]/div/div[5]/div/div/lib-city-history-observation/div/div[2]/table')

It seems that the table component is empty.

This is a dynamic page, with the table generated by Javascript. rvest alone will not suffice. Nonetheless, you could get the source content from the JSON API.

library(tidyverse)
library(rvest)
library(lubridate)
library(jsonlite)

# Read static html. It won't create the table, but it holds the API key
# we need to retrieve the source JSON.

htm_obj <- 
  read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')

# Retrieve the API key. This key is stored in a node with javascript content.
str_apikey <- 
  html_node(htm_obj, xpath = '//script[@id="app-root-state"]') %>%
  html_text() %>% gsub("^.*SUN_API_KEY&q;:&q;|&q;.*$", "", . )

# Create a URI pointong to the API', with the API key as the first key-value pair of the query
url_apijson <- paste0(
  "https://api.weather.com/v1/location/KDCA:9:US/observations/historical.json?apiKey=",
  str_apikey,
  "&units=e&startDate=20110101&endDate=20110101")
# Capture the JSON
json_obj <- fromJSON(txt = url_apijson)

# Wrangle the JSON's contents into the table you need
tbl_daily <- 
  json_obj$observations %>% as_tibble() %>% 
  mutate(valid_time_gmt = as_datetime(valid_time_gmt) %>% 
                          with_tz("America/New_York")) %>% # The timezone this airport (KDCA) is located at.
  select(valid_time_gmt, temp, dewPt, rh, wdir_cardinal, gust, pressure, precip_hrly) # The equvalent variables of your html table

Results: A nice table

# A tibble: 34 x 8
   valid_time_gmt       temp dewPt    rh wdir_cardinal gust  pressure precip_hrly
   <dttm>              <int> <int> <int> <chr>         <lgl>    <dbl>       <dbl>
 1 2010-12-31 23:52:00    38    NA    79 CALM          NA        30.1          NA
 2 2011-01-01 00:52:00    35    31    85 CALM          NA        30.1          NA
 3 2011-01-01 01:52:00    36    31    82 CALM          NA        30.1          NA
 4 2011-01-01 02:52:00    37    31    79 CALM          NA        30.1          NA
 5 2011-01-01 03:52:00    36    30    79 CALM          NA        30.1          NA
 6 2011-01-01 04:52:00    37    30    76 NNE           NA        30.1          NA
 7 2011-01-01 05:52:00    36    30    79 CALM          NA        30.1          NA
 8 2011-01-01 06:52:00    34    30    85 CALM          NA        30.1          NA
 9 2011-01-01 07:52:00    37    31    79 CALM          NA        30.1          NA
10 2011-01-01 08:52:00    44    38    79 CALM          NA        30.1          NA
# ... with 24 more rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM