简体   繁体   中英

Extracting text from HTML page in R

I am working on drugbank database, please i need help to extract specific text from the below HTML code:

<table>
<tr>
    <td>Text</td>
</tr>
<tr>
    <th>ATC Codes</th>
    <td>B01AC05
        <ul class="atc-drug-tree">
            <li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
        </ul>
    </td>
</tr>
<tr>
    <td>Text</td>
</tr>
</table>

i want to have the following as my output text as list object:

B01AC05
B01AC — Platelet aggregation inhibitors excl. heparin
B01A — ANTITHROMBOTIC AGENTS
B01 — ANTITHROMBOTIC AGENTS
B — BLOOD AND BLOOD FORMING ORGANS

I have tried the below function but its not working:

library(XML)

getATC <- function(id){
    url    <- "http://www.drugbank.ca/drugs/"
    dburl  <- paste(url, id, sep ="")
    tables <- readHTMLTable(dburl, header = F)
    table  <- tables[['atc-drug-tree']]
    table
}

ids  <- c("DB00208", "DB00209")
ref  <- apply(ids, 1, getATC)

NB: The url can be use to see the actual page i want to parse, the HTML snippet i provided was just and example.

Thanks

rvest makes web scraping pretty simple. Here's a solution using it.

library("rvest")
library("stringr")
your_html <- read_html('<table>
<tr>
          <td>Text</td>
          </tr>
          <tr>
          <th>ATC Codes</th>
          <td>B01AC05
          <ul class="atc-drug-tree">
          <li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
          <li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
          <li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
          <li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
          </ul>
          </td>
          </tr>
          <tr>
          <td>Text</td>
          </tr>
          </table>')
your_name <- 
  your_html %>% 
  html_nodes(xpath='//th[contains(text(), "ATC Codes")]/following-sibling::td') %>%
  html_text() %>%
  str_extract(".+(?=\n)")
list_elements <- 
  your_html %>%  html_nodes("li") %>% html_nodes("a") %>% html_text()
your_list <- list()
your_list[[your_name]] <- list_elements
> your_list
$B01AC05
[1] "B01AC — Platelet aggregation inhibitors excl. heparin"
[2] "B01A — ANTITHROMBOTIC AGENTS"                         
[3] "B01 — ANTITHROMBOTIC AGENTS"                          
[4] "B — BLOOD AND BLOOD FORMING ORGANS"        

Create the URL strings and sapply them using the getDrugs function which parses the HTML, extracts the root of the HTML tree, finds the ul node with the indicated class and returns its parent's text (but only before the first whitespace) followed by the text in each ./li/a grandchild:

library(XML)

getDrugs <- function(...) {
   doc <- htmlTreeParse(..., useInternalNodes = TRUE)
   xpathApply(xmlRoot(doc), "//ul[@class='atc-drug-tree']", function(node) {
     c(sub("\\s.*", "", xmlValue(xmlParent(node))), # get text before 1st whitespace
     xpathSApply(node, "./li/a", xmlValue)) # get text in each ./li/a node
   })
}


ids  <- c("DB00208", "DB00209")
urls <- paste0("http://www.drugbank.ca/drugs/", ids)
L <- sapply(urls, getDrugs)

giving the following list (one component per URL and a component within each for each drug found in that URL):

> L
$`http://www.drugbank.ca/drugs/DB00208`
$`http://www.drugbank.ca/drugs/DB00208`[[1]]
[1] "B01AC05B01AC"                                         
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"                         
[4] "B01 — ANTITHROMBOTIC AGENTS"                          
[5] "B — BLOOD AND BLOOD FORMING ORGANS"                   


$`http://www.drugbank.ca/drugs/DB00209`
$`http://www.drugbank.ca/drugs/DB00209`[[1]]
[1] "A03DA06A03DA"                                                           
[2] "A03DA — Synthetic anticholinergic agents in combination with analgesics"
[3] "A03D — ANTISPASMODICS IN COMBINATION WITH ANALGESICS"                   
[4] "A03 — DRUGS FOR FUNCTIONAL GASTROINTESTINAL DISORDERS"                  
[5] "A — ALIMENTARY TRACT AND METABOLISM"                                    

$`http://www.drugbank.ca/drugs/DB00209`[[2]]
[1] "A03DA06A03DA"                                        
[2] "G04BD — Drugs for urinary frequency and incontinence"
[3] "G04B — UROLOGICALS"                                  
[4] "G04 — UROLOGICALS"                                   
[5] "G — GENITO URINARY SYSTEM AND SEX HORMONES"          

We could create a 5x3 matrix out of the above like this:

simplify2array(do.call(c, L))

And here is a test using the input in the question:

Lines <- '<table>
<tr>
    <td>Text</td>
</tr>
<tr>
    <th>ATC Codes</th>
    <td>B01AC05
        <ul class="atc-drug-tree">
            <li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
        </ul>
    </td>
</tr>
<tr>
    <td>Text</td>
</tr>
</table>'

getDrugs(Lines, asText = TRUE)

giving:

[[1]]
[1] "B01AC05"                                               
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"                         
[4] "B01 — ANTITHROMBOTIC AGENTS"                          
[5] "B — BLOOD AND BLOOD FORMING ORGANS"    

readHTMLTable is not working because it can't read the headers in tables 3 and 4.

url <- "http://www.drugbank.ca/drugs/DB00208"
doc <- htmlParse(readLines(url))
summary(doc)
$nameCounts

      td        a       tr       li       th     span      div        p   strong      img    table ...
     745      399      342      175      159      137       66       49       46       27       27  

#errors
readHTMLTable(doc)
readHTMLTable(doc, which=3)   
# this works
readHTMLTable(doc, which=3, header=FALSE)

Also, ATC codes is not within a nearby table tag, so you have to use xpath like the other answers here.

xpathSApply(doc, '//ul[@class="atc-drug-tree"]/*', xmlValue)
[1] "B01AC — Platelet aggregation inhibitors excl. heparin" "B01A — ANTITHROMBOTIC AGENTS"                         
[3] "B01 — ANTITHROMBOTIC AGENTS"                           "B — BLOOD AND BLOOD FORMING ORGANS"     

xpathSApply(doc, '//ul[@class="atc-drug-tree"]/../node()[1]', xmlValue)
[1] "B01AC05"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM