简体   繁体   English

RVEST HTML从跨度刮文本

[英]rvest html scraping text from span

I'm trying to get just the coordinates from this page, http://hol.osu.edu/spmInfo.html?id=CMNHENT0042647 . 我正试图从此页面http://hol.osu.edu/spmInfo.html?id=CMNHENT0042647获得坐标。 When I try to get the text all I get is " " in return. 当我尝试获取文本时,我得到的只是" "作为回报。

library(rvest)

ID<-"CMNHENT0042647"

HOLWebSite<-read_html("http://hol.osu.edu/spmInfo.html?id=",ID)

Coords<-HOLWebSite%>%
  html_nodes("span#hymSpmCoordsID.boldedText")%>%
  html_text()

Is it because it is in a span? 是因为它在跨度中吗?

What's actually in the span in the scraped page is <span class="boldedText" id="hymSpmCoordsID">\\n <!-- To Be DB Generated //-->\\n</span> . <span class="boldedText" id="hymSpmCoordsID">\\n <!-- To Be DB Generated //-->\\n</span>页面中的跨度实际上是<span class="boldedText" id="hymSpmCoordsID">\\n <!-- To Be DB Generated //-->\\n</span> There are no co-ordinates in the HTML. HTML中没有坐标。

You can verify this by going to the page and viewing the source. 您可以通过转到页面并查看源来验证这一点。

You can grab it this way: 您可以通过以下方式获取它:

library(httr)
library(jsonlite)

get_specimen_info <- function(specimen) {

  GET(
    url = "http://hol.osu.edu/hymDB/OJ_Break.getSpmInfo",
    query = list(
      cuid = specimen,
      callback = "",
      noCacheIE = round(as.numeric(Sys.time()) * 1000)
    ),
    add_headers(Referer = sprintf("http://hol.osu.edu/spmInfo.html?id=%s", specimen)),
    set_cookies(hymShowInfo = "Y")
  ) -> res

  stop_for_status(res)

  res <- trimws(content(res, as="text"))
  res <- gsub("^\\(|);$", "", res)
  res <- jsonlite::fromJSON(res)
  res

}

The page retrieves the data dynamically and that function (which takes a species code as a parameter) mimics the call. 该页面动态地检索数据,并且该函数(将种类代码作为参数)模仿了该调用。

Now to use it: 现在使用它:

spec <- get_specimen_info("CMNHENT0042647")

str(spec)
## List of 1
##  $ spmInfo:List of 47
##   ..$ cuid              : chr "CMNHENT0042647"
##   ..$ alt_ids           : list()
##   ..$ loc_id            : int 9661
##   ..$ loc_name          : chr "Defiance Township, Defiance Co., OH"
##   ..$ lat               : num 41.3
##   ..$ lng               : num -84.4
##   ..$ elev              : chr ""
##   ..$ max_elev          : chr ""
##   ..$ prec_type         : chr "POINT"
##   ..$ loc_comments      : chr ""
##   ..$ coord_source      : chr "USGS-GNIS"
##   ..$ hier              :List of 7
##   .. ..$ place:List of 3
##   .. .. ..$ id  : chr "202"
##   .. .. ..$ name: chr "Defiance"
##   .. .. ..$ type: chr "County"
##   .. ..$ pol2 :List of 3
##   .. .. ..$ id  : chr "202"
##   .. .. ..$ name: chr "Defiance"
##   .. .. ..$ type: chr "County"
##   .. ..$ pol1 :List of 3
##   .. .. ..$ id  : chr "82"
##   .. .. ..$ name: chr "Ohio"
##   .. .. ..$ type: chr "State"
##   .. ..$ pol0 :List of 3
##   .. .. ..$ id  : chr "81"
##   .. .. ..$ name: chr "United States"
##   .. .. ..$ type: chr "Country"
##   .. ..$ pol-1:List of 3
##   .. .. ..$ id  : chr "23"
##   .. .. ..$ name: chr "North America"
##   .. .. ..$ type: chr "Continent"
##   .. ..$ pol-3:List of 3
##   .. .. ..$ id  : chr "5621"
##   .. .. ..$ name: chr "Western Hemisphere"
##   .. .. ..$ type: chr "Hemisphere"
##   .. ..$ pol-4:List of 3
##   .. .. ..$ id  : chr "0"
##   .. .. ..$ name: chr "Earth"
##   .. .. ..$ type: chr ""
##   ..$ coll_event_id     : chr "343832"
##   ..$ coll_method       : chr "none specified"
##   ..$ coll_date         : chr "18 August 1981"
##   ..$ coll_date_alt     : chr "18.VIII.1981"
##   ..$ coll_time         :List of 2
##   .. ..$ start: chr ""
##   .. ..$ end  : chr ""
##   ..$ date_type         : chr "CLOCK_TIME"
##   ..$ field_code        : chr ""
##   ..$ collector         : chr "Perry, T. E."
##   ..$ collector_alt     : chr "T. E. Perry"
##   ..$ collector_extended:'data.frame':  1 obs. of  6 variables:
##   .. ..$ last_name   : chr "Perry"
##   .. ..$ first_name  : chr ""
##   .. ..$ initials    : chr "T. E."
##   .. ..$ generation  : chr ""
##   .. ..$ name_order  : chr "W"
##   .. ..$ collector_id: int 33377
##   ..$ determinations    :'data.frame':  1 obs. of  17 variables:
##   .. ..$ det_id       : int 2217760
##   .. ..$ tnuid        : int 355808
##   .. ..$ id           : int 355808
##   .. ..$ taxon        : chr "Macromia taeniolata"
##   .. ..$ author       : chr "Rambur"
##   .. ..$ det_date     : chr "2016"
##   .. ..$ status       : chr "Original name/combination"
##   .. ..$ det_status   : chr "CURRENT"
##   .. ..$ type_status  : chr ""
##   .. ..$ determiner_id: int 0
##   .. ..$ cu_coll_id   : chr ""
##   .. ..$ coll_id      : chr ""
##   .. ..$ rank         : chr "Species"
##   .. ..$ valid        : chr "Valid"
##   .. ..$ homonym      : chr "N"
##   .. ..$ common_names :List of 1
##   .. .. ..$ : chr [1:2] "Wabash River Cruiser" "Royal River Cruiser"
##   .. ..$ determiner   : chr ""
##   ..$ Class             : chr "Hexapoda"
##   ..$ Genus             : chr "Macromia"
##   ..$ Species           : chr "Macromia taeniolata"
##   ..$ Family            : chr "Corduliidae"
##   ..$ Phylum            : chr "Arthropoda"
##   ..$ Kingdom           : chr "Animalia"
##   ..$ Order             : chr "Odonata"
##   ..$ habitat           : chr ""
##   ..$ associations      : list()
##   ..$ spm_sex           : chr "M"
##   ..$ spm_num           : chr "1"
##   ..$ life_status       : chr "adult"
##   ..$ inst_id           : chr "195"
##   ..$ inst_name         : chr "Cleveland Museum of Natural History, OH"
##   ..$ inst_code         : chr "CLEV"
##   ..$ vouchered         : logi TRUE
##   ..$ comments          : chr "[OH, Defiance Co., Defiance, 18-AUG-1981, T. E. Perry, coll.] [ADP - CMNH 13395]"
##   ..$ enterer           : chr "Roberta DeSalvo"
##   ..$ updater           : chr "hmajewski"
##   ..$ date_recorded     : chr "2-AUG-2016"
##   ..$ preparations      :'data.frame':  1 obs. of  3 variables:
##   .. ..$ prep_type    : chr "pin"
##   .. ..$ prep_contents: chr ""
##   .. ..$ num_preps    : int 1
##   ..$ images            : list()
##   ..$ sequences         : list()
##   ..$ last_update       : chr "2016-08-15T12:32:24Z"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM