Parsing HTML tables using the XML / RCurl R packages, without using the readHTMLTable function

Question

I am trying to scrape/extract data from the single html table on: http://www.theplantlist.org/tpl/record/kew-419248 and a number of very similar pages. I initially tried using the following function to read the table, but it wasn't ideal because I want to separate each species name into its component parts (genus/species/infraspecies/author etc).

library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")

I used SelectorGadget to identify a unique XPATH to each table element that I want to extract (not necessarily the shortest):

For genus names : // [contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]// [contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]

For species names: // [contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]// [contains(concat( " ", @class, " " ), concat( " ", "species", " " ))]

For infraspecies ranks: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]

For infraspecies names: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]

For confidence levels (image): // [contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img For sources: // [contains(concat( " ", @class, " " ), concat( " ", "source", " " ))]//a

I now want to extract the information into a dataframe/table.

I tried using the xpathSApply function of the XML package to extract some of this data:

eg for infraspecies ranks

library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)

However, this method is problematic because of gaps in the data (eg only some rows of the table have an infraspecies rank, so all I have returned is a list of the three ranks in the table, with no gaps). The data output is also of a class that I have had trouble attaching to a dataframe.

Does anyone know a better way to extract information from this table into a dataframe?

Any help would be much appreciated!

Tom

Answer 1

Here is another solution, which splits each species name into its component parts

library(XML)
library(plyr)

# read url into html tree
url = "http://www.theplantlist.org/tpl/record/kew-419248"
doc = htmlTreeParse(url, useInternalNodes = T)

# extract nodes containing desired information
xp_expr = "//table[@class= 'names synonyms']/tbody/tr"
nodes = getNodeSet(doc, xp_expr)

# function to extract desired fields from a given node    
fields = list('genus', 'species', 'infraspe', 'authorship')
read_node = function(node){

    dl = lapply(fields, function(x) xpathSApply(node, 
       paste(".//*[@class = ", "'", x, "'", "]", sep = ""), xmlValue))
    tmp = rep(' ', length(dl))
    tmp[sapply(dl, length) == 1] = unlist(dl)
    confidence = xpathSApply(node, './/img', xmlGetAttr, 'alt')
    return(c(tmp, confidence))
}

# apply function to all nodes and return data frame
df = ldply(nodes, read_node)
names(df) = c(fields, 'confidence')

It produces the following output

 genus      species     infraspe                      authorship confidence
1 Critesion     chilense              (Roem. & Schult.) Ã\u0081.LÃ¶ve          H
2   Hordeum     chilense     chilense                                          L
3   Hordeum  cylindricum                                       Steud.          H
4   Hordeum depauperatum                                       Steud.          H
5   Hordeum     pratense brongniartii                       Macloskie          L
6   Hordeum    secalinum     chilense                   Ã\u0089.Desv.          L

Answer 2

The following code parses your table into a matrix.

Caveats:

The confidence level column is blank, since this is not text but an image. If this is important, you should be able to retrieve the image location, and parse that.
There are some encoding issues (UTF-8 character get converted into ASCII on my machine). I don't yet know how to fix this.

The code:

library(XML)
library(RCurl)

baseURL <- "http://www.theplantlist.org/tpl/record/kew-419248"
txt <- getURL(url=baseURL)

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")
t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7)]))

The results:

     [,1]                                                [,2]      [,3] [,4]  
[1,] "Critesion chilense (Roem. & Schult.) Ã.LÃ¶ve" "Synonym" ""   "WCSP"
[2,] "Hordeum chilense var. chilense "                   "Synonym" ""   "TRO" 
[3,] "Hordeum cylindricum Steud. [Illegitimate]"         "Synonym" ""   "WCSP"
[4,] "Hordeum depauperatum Steud."                       "Synonym" ""   "WCSP"
[5,] "Hordeum pratense var. brongniartii Macloskie"      "Synonym" ""   "WCSP"
[6,] "Hordeum secalinum var. chilense Ã.Desv."        "Synonym" ""   "WCSP"

Parsing HTML tables using the XML / RCurl R packages, without using the readHTMLTable function

Question

2 answers

solution1
5 2011-06-21 16:59:50

solution2
2 2011-06-21 15:13:59

Parsing HTML tables using the XML / RCurl R packages, without using the readHTMLTable function

Question

2 answers

solution1 5 2011-06-21 16:59:50

solution2 2 2011-06-21 15:13:59

solution1
5 2011-06-21 16:59:50

solution2
2 2011-06-21 15:13:59