Scraping html header with R

Question

My objective

I'm attempting to use R to scrape text from a web page: https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150 . For the purposes of this question, my goal is to access the header text that contains the station number ("Bridgeport, CT - Station ID: 8467150"). Below is a screenshot of the page. I've highlighted the text that I'm trying to verify is present, and the text is also highlighted in the inspect element pane.

My old approach was to access the full text of the site with readLines() . A recent update to the website has made the text more difficult to access, and the station name/number is no longer visible to readLines() :

url <- "https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150"
stn <- "8467150"

webpage <- readLines(url, warn = FALSE)

### grep indicates that the station number is not present in the scraped text
grep(x = webpage, pattern = stn, value = TRUE)

Potential solutions

I am therefore looking for a new way to access my target text. I have tried using httr, but still cannot get all the html text to be included in what I scrape from the web page. The XML and rvest packages also seem promising, but I am not sure how to identify the relevant CSS selector or XPath expression.

### an attempt using httr
hDat <- httr::RETRY("GET", url, times = 10)
txt  <- httr::content(hDat, "text")

### grep indicates that the station number is still not present
grep(x = txt, pattern = stn, value = TRUE)


### a partial attempt using XML
h  <- xml2::read_html(url)
h2 <- XML::htmlTreeParse(h, useInternalNodes=TRUE, asText = TRUE)

### this may end up working, but I'm not sure how to identify the correct path
html.parse <- XML::xpathApply(h2, path = "div.span8", XML::xmlValue)

Regardless of the approach, I would welcome any suggestions that can help me access the header text containing the station name/number.

Answer 1

Unless you use Selenium, it will be very hard. NOAA encourages you to access their free Restful json APIs. It also goes to great lengths to discourage html scraping.

That said, the following code will get what you want from a NOAA json in a data frame.

library(tidyverse)
library(jsonlite)

j1 <- fromJSON(txt = 'https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/8467150.json', simplifyDataFrame = TRUE, flatten = TRUE)

j1$stations %>% as_tibble() %>% select(name, state, id)

Results

    # A tibble: 1 x 3
  name       state id     
  <chr>      <chr> <chr>  
1 Bridgeport CT    8467150

Scraping html header with R

Question

My objective

Potential solutions

1 answers

solution1
1 ACCPTED 2021-03-05 23:42:20

Results

Scraping html header with R

Question

My objective

Potential solutions

1 answers

solution1 1 ACCPTED 2021-03-05 23:42:20

Results

solution1
1 ACCPTED 2021-03-05 23:42:20