简体   繁体   中英

Scraping html table with images using XML R package

I want to scrape html tables using the XML package of R, in a similar way to discussed on this thread:

Scraping html tables into R data frames using the XML package

The main difference with the data I want to extract, is that I also want text relating to an image in the html table. For example the table at http://www.theplantlist.org/tpl/record/kew-422570 contains a column for "Confidence" with an image showing one to three stars. If I use:

readHTMLTable(" http://www.theplantlist.org/tpl/record/kew-422570 ")

then the output column for "Confidence" is blank apart from the header. Is there any way to get some form of data in this column, for example the HTML code linking to the appropriate image?

Any suggestions of how to go about this would be much appreciated!

I was able to find the Xpath query to the image name using SelectorGadeget

library(XML)
library(RCurl)
d = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-422570"))
path = '//*[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img'

xpathSApply(d, path, xmlAttrs)["src",]

[1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
[6] "/img/H.png" "/img/H.png"

Here's an rvest solution with an even simpler CSS selector:

library(rvest)

pg <- html("http://www.theplantlist.org/tpl/record/kew-422570")
pg %>% html_nodes("td > img") %>% html_attr("src")

## [1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
## [6] "/img/H.png" "/img/H.png"

You could also use the elFun argument to extract that attribute following section 5.2.2.1 in the XML book (I had to add... to avoid an unused argument error)

getCL <- function(node, ...){
if(xmlName(node) == "td" && !is.null(node[["img"]]))
    xmlGetAttr(node[["img"]], "alt")
  else
    xmlValue(node)
}

url <- "http://www.theplantlist.org/tpl/record/kew-422570"
readHTMLTable(url, which=1, elFun = getCL)

                                                Name  Status Confi­-dence level Source
1                                Elymus arenarius L. Synonym                 H   WCSP
2 Elymus arenarius subsp. geniculatus (Curtis) Husn. Synonym                 L    TRO
3                Elymus geniculatus Curtis [Invalid] Synonym                 H   WCSP
4              Frumentum arenarium (L.) E.H.L.Krause Synonym                 H   WCSP
5                       Hordeum arenarium (L.) Asch. Synonym                 H   WCSP
6                            Hordeum villosum Moench Synonym                 H   WCSP
7                    Triticum arenarium (L.) F.Herm. Synonym                 H   WCSP

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM