简体   繁体   English

在R中使用readHTMLTable删除行

[英]Dropped rows using readHTMLTable in R

I am attempting to extract model data from NOAA using readHTMLTable. 我正在尝试使用readHTMLTable从NOAA提取模型数据。 The table I am trying to get has multiple subtitles, where each subtitle consists of a single cell spanning all columns, as far as I can tell from the HTML. 我试图获得的表具有多个字幕,据我从HTML所知,每个字幕都由一个跨所有列的单元格组成。 For some reason, this is causing readHTMLTable to omit the row immediately following the subtitle. 由于某种原因,这导致readHTMLTable忽略了紧跟在字幕后面的行。 Here's code that will reproduce the issue: 这是将重现此问题的代码:

library(XML)

url <- "http://nomads.ncep.noaa.gov/"
ncep.tables = readHTMLTable(url, header=TRUE)

#Find the list of real time models
for(ncep.table in ncep.tables) {
    if("grib filter" %in% names(ncep.table) & "gds-alt" %in% names(ncep.table)) {
        rt.tbl <- ncep.table
     }
}

#Here's where the problem is:
cat(paste(rt.tbl[["Data Set"]][15:20], collapse = "\n"))

#On the website, there is a model called "AQM Daily Maximum"
#between Regional Models and AQM Hourly Surface Ozone
#but it's missing now...

So, if you go to http://nomads.ncep.noaa.gov/ and look at the central table (the one with "Data Set" in the top right cell), you'll see a subtitle called "Regional Models." 因此,如果您访问http://nomads.ncep.noaa.gov/并查看中央表(位于右上角单元格中带有“数据集”的表),则会看到一个名为“区域模型”的字幕。 “ The AQM Daily Maximum model immediately below the subtitle is skipped during the extraction in the code above. 在上面代码中的提取过程中,字幕下方的AQM Daily Maximum模型被跳过。

I maintain the rNOMADS package in R, so if I can get this to work it will save me loads of time maintaining the package as well as keep it accurate and up to date for my users. 我在R中维护了rNOMADS软件包,因此,如果可以运行该软件包,将为我节省维护该软件包的时间,并为我的用户保持准确和最新的状态。 Thank you for your help! 谢谢您的帮助!

By golly, I think I got it. 天哪,我想我明白了。 You won't be able to use readHTMLTable (and, I now know the XML package code way more than I even did before…some serious R-fu in that code) and I'm using rvest simply because I mix use of XPath and CSS selectors (I ended up thinking more in XPath though). 您将无法使用readHTMLTable (而且,我现在比以前更加了解XML封装代码…在该代码中有一些严重的R-fu),而我使用rvest仅仅是因为我混合使用XPath和CSS选择器(尽管我最终在XPath中考虑了更多)。 dplyr is only for gimpse . dplyr仅适用于gimpse

library(XML)
library(dplyr)
library(rvest)

trim <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "", x)

# neither rvest::html nor rvest::html_session liked it, hence using XML::htmlParse
doc <- htmlParse("http://nomads.ncep.noaa.gov/")

ds <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
                                            descendant::td[contains(., 'http')]/
                                            preceding-sibling::td[3]")

data_set <- ds %>% html_text() %>% trim()
data_set_descr_link <- ds %>% html_nodes("a") %>% html_attr("href")

freq <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
                           descendant::td[contains(., 'hourly') or
                                          contains(., 'hours') or
                                          contains(., 'daily') or
                                          contains(., '06Z')]") %>%
  html_text() %>% trim()

grib_filter <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
                                  descendant::td[contains(., 'http')]/preceding-sibling::td[1]") %>%
  sapply(function(x) {
    ifelse(x %>% xpathApply("boolean(./a)"),
           x %>% html_node("a") %>% html_attr("href"),
           NA)
  })

http_link <- doc %>% html_nodes("a[href^='/pub/data/']") %>% html_attr("href")

gds_alt <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
                              descendant::td[contains(., 'http')]/following-sibling::td[1]") %>%
  sapply(function(x) {
    ifelse(x %>% xpathApply("boolean(./a)"),
           x %>% html_node("a") %>% html_attr("href"),
           NA)
  })

nom <- data.frame(data_set,
                  data_set_descr_link,
                  freq,
                  grib_filter,
                  gds_alt)

glimpse(nom)

## Variables:
## $ data_set            (fctr) FNL, GFS 1.0x1.0 Degree, GFS 0.5x0.5 Degr...
## $ data_set_descr_link (fctr) txt_descriptions/fnl_doc.shtml, txt_descr...
## $ freq                (fctr) 6 hours, 6 hours, 6 hours, 12 hours, 6 ho...
## $ grib_filter         (fctr) cgi-bin/filter_fnl.pl, cgi-bin/filter_gfs...
## $ gds_alt             (fctr) dods-alt/fnl, dods-alt/gfs, dods-alt/gfs_...

head(nom)

##                             data_set
## 1                                FNL
## 2                 GFS 1.0x1.0 Degree
## 3                 GFS 0.5x0.5 Degree
## 4                 GFS 2.5x2.5 Degree
## 5       GFS Ensemble high resolution
## 6 GFS Ensemble Precip Bias-Corrected
##
##                                             data_set_descr_link     freq
## 1                                txt_descriptions/fnl_doc.shtml  6 hours
## 2                txt_descriptions/GFS_high_resolution_doc.shtml  6 hours
## 3                    txt_descriptions/GFS_half_degree_doc.shtml  6 hours
## 4                 txt_descriptions/GFS_Low_Resolution_doc.shtml 12 hours
## 5       txt_descriptions/GFS_Ensemble_high_resolution_doc.shtml  6 hours
## 6 txt_descriptions/GFS_Ensemble_precip_bias_corrected_doc.shtml    daily
##
##                       grib_filter          gds_alt
## 1           cgi-bin/filter_fnl.pl     dods-alt/fnl
## 2           cgi-bin/filter_gfs.pl     dods-alt/gfs
## 3        cgi-bin/filter_gfs_hd.pl  dods-alt/gfs_hd
## 4       cgi-bin/filter_gfs_2p5.pl dods-alt/gfs_2p5
## 5          cgi-bin/filter_gens.pl    dods-alt/gens
## 6 cgi-bin/filter_gensbc_precip.pl dods-alt/gens_bc

Please make sure the columns match. 请确保列匹配。 I eyeballed it, but a verification would be awesome. 我盯着它,但是验证会很棒。 NOTE: there may be a better way to do the sapply s (anyone shld feel freed to edit that in, too, crediting yourself). 注意:可能有更好的方法来执行sapply (任何人都可以随意编辑它,也可以相信自己)。

It's really fragile code. 这是非常脆弱的代码。 ie if the format changes, it'll croak (but that's kinda true for all scraping). 即,如果格式更改,它将发出嘶哑的声音(但这对于所有抓取都是正确的)。 It should withstand them actually creating valid HTML (this is wretched HTML btw), but most of the code relies on the http column remaining valid since most of the other column extractions rely on it. 它应该能够承受他们实际创建的有效HTML(这是可怜的HTML btw),但是大多数代码依赖于保持有效的http列,因为大多数其他列提取都依赖于它。 Your missing model is there as well. 您缺少的模型也在那里。 If any of the XPath is confusing, drop a comment q and I'll try to 'splain. 如果任何XPath令人困惑,请添加注释q,我将尝试'splain。

Sometimes you just have to fix bad HTML, so you could add tr tags to the start of those rows. 有时,您只需要修复错误的HTML,就可以在这些行的开头添加tr标记。

url <- "http://nomads.ncep.noaa.gov/"
x <- readLines(url, encoding="UTF-8")
doc <- htmlParse(x)

# check nodes after subheaders - only 2 of 5 rows missing tr (2nd and 3rd element)
getNodeSet(doc, "//td[@colspan='7']/../following-sibling::*[1]")
# fix text - probably some way to fix XML doc too?
n <- grep(">AQM Daily Maximum<", x)
x[n] <- paste0("<tr>", x[n])
n <- grep(">RTOFS Atlantic<", x)
x[n] <- paste0("<tr>", x[n])

doc <- htmlParse(x)
## ok..
getNodeSet(doc, "//td[@colspan='7']/../following-sibling::*[1]")
readHTMLTable(doc, which=9, header=TRUE)

                                      Data Set     freq grib filter http     gds-alt
1                                 Global Models     <NA>        <NA> <NA>        <NA>
2                                           FNL  6 hours grib filter http OpenDAP-alt
3                            GFS 1.0x1.0 Degree  6 hours grib filter http OpenDAP-alt
...
16 Climate Forecast System 3D Pressure Products  6 hours grib filter http           -
17                              Regional Models     <NA>        <NA> <NA>        <NA>
18                            AQM Daily Maximum 06Z, 12Z grib filter http OpenDAP-alt
19                     AQM Hourly Surface Ozone 06Z, 12Z grib filter http OpenDAP-alt
20                                 HIRES Alaska    daily grib filter http OpenDAP-alt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM