简体   繁体   中英

Web Scraping with rvest and xml2

I'm trying to scrape the date and policy type for COVID related announcements from this url:https://covid19.healthdata.org/united-states-of-america/alabama

The first date I'm trying to pull is the "April 4th, 2020" date for Alabama's Stay at Home Order.

As far as I can tell (as I am new to this), it has the xpath:

 "//[@id="root"]/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span"

I've been using the following lines to try to retrieve it -

data <- read_html(url) %>% 
  html_nodes("span.ant-statistic-content-value")

data <- read_html(url) %>%
  html_nodes(xpath = "//*[@id='root']/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span")

Neither are able to pull the information I'm looking for. Any help would be appreciated!

The data for this page is stored in a series of JSON files. If you use the developer tools from your browser and look for the Networks files of type XHR; you should obtain a list similar to this (Safari browser below): 在此处输入图像描述

Right click the names to copy URL link.

This script should get you started:

library(jsonlite)
#obtain the list of locations
locations<-fromJSON("https://covid19.healthdata.org/api/metadata/location?v=7", flatten = TRUE)

head(locations[, 1:9])
#get list if US locations
US <- locations$children[locations$location_name =="United States of America"]
head(US[[1]])

#Get data frame from interventions
#Create link with desired location_id (569 is Virginia)
#paste0("https://covid19.healthdata.org/api/data/intervention?location=", "569")
Interventions <- fromJSON("https://covid19.healthdata.org/api/data/intervention?location=569", flatten = TRUE)

Interventions
# date_reported covid_intervention_id location_id covid_intervention_measure_id   covid_intervention_measure_name
# 1 2020-03-30 00:00:00                   110         569                             1 People instructed to stay at home
# 2 2020-03-16 00:00:00                   258         569                             2     Educational facilities closed
# 3 2020-04-19 00:00:00                   437         569                             7          Assumed_implemented_date

#Repeat for other links of interest

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM