简体   繁体   中英

Scraping html header with R

My objective

I'm attempting to use R to scrape text from a web page: https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150 . For the purposes of this question, my goal is to access the header text that contains the station number ("Bridgeport, CT - Station ID: 8467150"). Below is a screenshot of the page. I've highlighted the text that I'm trying to verify is present, and the text is also highlighted in the inspect element pane.

目标网页的屏幕截图。我要查找的文本以黄色突出显示,并在元素检查窗格中被选中

My old approach was to access the full text of the site with readLines() . A recent update to the website has made the text more difficult to access, and the station name/number is no longer visible to readLines() :

url <- "https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150"
stn <- "8467150"

webpage <- readLines(url, warn = FALSE)

### grep indicates that the station number is not present in the scraped text
grep(x = webpage, pattern = stn, value = TRUE)

Potential solutions

I am therefore looking for a new way to access my target text. I have tried using httr, but still cannot get all the html text to be included in what I scrape from the web page. The XML and rvest packages also seem promising, but I am not sure how to identify the relevant CSS selector or XPath expression.

### an attempt using httr
hDat <- httr::RETRY("GET", url, times = 10)
txt  <- httr::content(hDat, "text")

### grep indicates that the station number is still not present
grep(x = txt, pattern = stn, value = TRUE)


### a partial attempt using XML
h  <- xml2::read_html(url)
h2 <- XML::htmlTreeParse(h, useInternalNodes=TRUE, asText = TRUE)

### this may end up working, but I'm not sure how to identify the correct path
html.parse <- XML::xpathApply(h2, path = "div.span8", XML::xmlValue)

Regardless of the approach, I would welcome any suggestions that can help me access the header text containing the station name/number.

Unless you use Selenium, it will be very hard. NOAA encourages you to access their free Restful json APIs. It also goes to great lengths to discourage html scraping.

That said, the following code will get what you want from a NOAA json in a data frame.

library(tidyverse)
library(jsonlite)

j1 <- fromJSON(txt = 'https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/8467150.json', simplifyDataFrame = TRUE, flatten = TRUE)

j1$stations %>% as_tibble() %>% select(name, state, id)

Results

    # A tibble: 1 x 3
  name       state id     
  <chr>      <chr> <chr>  
1 Bridgeport CT    8467150

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM