简体   繁体   English

从R中的HTML页面提取文本

[英]Extracting text from HTML page in R

I am working on drugbank database, please i need help to extract specific text from the below HTML code: 我正在研究Drugbank数据库,请从以下HTML代码中提取特定文本的帮助:

<table>
<tr>
    <td>Text</td>
</tr>
<tr>
    <th>ATC Codes</th>
    <td>B01AC05
        <ul class="atc-drug-tree">
            <li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
        </ul>
    </td>
</tr>
<tr>
    <td>Text</td>
</tr>
</table>

i want to have the following as my output text as list object: 我想将以下内容作为列表对象的输出文本:

B01AC05
B01AC — Platelet aggregation inhibitors excl. heparin
B01A — ANTITHROMBOTIC AGENTS
B01 — ANTITHROMBOTIC AGENTS
B — BLOOD AND BLOOD FORMING ORGANS

I have tried the below function but its not working: 我尝试了以下功能,但无法正常工作:

library(XML)

getATC <- function(id){
    url    <- "http://www.drugbank.ca/drugs/"
    dburl  <- paste(url, id, sep ="")
    tables <- readHTMLTable(dburl, header = F)
    table  <- tables[['atc-drug-tree']]
    table
}

ids  <- c("DB00208", "DB00209")
ref  <- apply(ids, 1, getATC)

NB: The url can be use to see the actual page i want to parse, the HTML snippet i provided was just and example. 注意:URL可用于查看我要解析的实际页面,我提供的HTML代码段只是一个例子。

Thanks 谢谢

rvest makes web scraping pretty simple. rvest使网络抓取非常简单。 Here's a solution using it. 这是使用它的解决方案。

library("rvest")
library("stringr")
your_html <- read_html('<table>
<tr>
          <td>Text</td>
          </tr>
          <tr>
          <th>ATC Codes</th>
          <td>B01AC05
          <ul class="atc-drug-tree">
          <li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
          <li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
          <li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
          <li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
          </ul>
          </td>
          </tr>
          <tr>
          <td>Text</td>
          </tr>
          </table>')
your_name <- 
  your_html %>% 
  html_nodes(xpath='//th[contains(text(), "ATC Codes")]/following-sibling::td') %>%
  html_text() %>%
  str_extract(".+(?=\n)")
list_elements <- 
  your_html %>%  html_nodes("li") %>% html_nodes("a") %>% html_text()
your_list <- list()
your_list[[your_name]] <- list_elements
> your_list
$B01AC05
[1] "B01AC — Platelet aggregation inhibitors excl. heparin"
[2] "B01A — ANTITHROMBOTIC AGENTS"                         
[3] "B01 — ANTITHROMBOTIC AGENTS"                          
[4] "B — BLOOD AND BLOOD FORMING ORGANS"        

Create the URL strings and sapply them using the getDrugs function which parses the HTML, extracts the root of the HTML tree, finds the ul node with the indicated class and returns its parent's text (but only before the first whitespace) followed by the text in each ./li/a grandchild: 创建URL字符串和sapply使用它们getDrugs函数解析HTML,提取HTML树的根,找到了ul节点与指定的类,然后在文本返回其父母的文本(但只有第一个空格前)每个./li/a孙子:

library(XML)

getDrugs <- function(...) {
   doc <- htmlTreeParse(..., useInternalNodes = TRUE)
   xpathApply(xmlRoot(doc), "//ul[@class='atc-drug-tree']", function(node) {
     c(sub("\\s.*", "", xmlValue(xmlParent(node))), # get text before 1st whitespace
     xpathSApply(node, "./li/a", xmlValue)) # get text in each ./li/a node
   })
}


ids  <- c("DB00208", "DB00209")
urls <- paste0("http://www.drugbank.ca/drugs/", ids)
L <- sapply(urls, getDrugs)

giving the following list (one component per URL and a component within each for each drug found in that URL): 提供以下列表(每个URL一个组件,以及该URL中找到的每种药物的每个组件中的一个组件):

> L
$`http://www.drugbank.ca/drugs/DB00208`
$`http://www.drugbank.ca/drugs/DB00208`[[1]]
[1] "B01AC05B01AC"                                         
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"                         
[4] "B01 — ANTITHROMBOTIC AGENTS"                          
[5] "B — BLOOD AND BLOOD FORMING ORGANS"                   


$`http://www.drugbank.ca/drugs/DB00209`
$`http://www.drugbank.ca/drugs/DB00209`[[1]]
[1] "A03DA06A03DA"                                                           
[2] "A03DA — Synthetic anticholinergic agents in combination with analgesics"
[3] "A03D — ANTISPASMODICS IN COMBINATION WITH ANALGESICS"                   
[4] "A03 — DRUGS FOR FUNCTIONAL GASTROINTESTINAL DISORDERS"                  
[5] "A — ALIMENTARY TRACT AND METABOLISM"                                    

$`http://www.drugbank.ca/drugs/DB00209`[[2]]
[1] "A03DA06A03DA"                                        
[2] "G04BD — Drugs for urinary frequency and incontinence"
[3] "G04B — UROLOGICALS"                                  
[4] "G04 — UROLOGICALS"                                   
[5] "G — GENITO URINARY SYSTEM AND SEX HORMONES"          

We could create a 5x3 matrix out of the above like this: 我们可以像上面这样创建一个5x3矩阵:

simplify2array(do.call(c, L))

And here is a test using the input in the question: 这是使用问题输入的测试:

Lines <- '<table>
<tr>
    <td>Text</td>
</tr>
<tr>
    <th>ATC Codes</th>
    <td>B01AC05
        <ul class="atc-drug-tree">
            <li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
            <li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
        </ul>
    </td>
</tr>
<tr>
    <td>Text</td>
</tr>
</table>'

getDrugs(Lines, asText = TRUE)

giving: 赠送:

[[1]]
[1] "B01AC05"                                               
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"                         
[4] "B01 — ANTITHROMBOTIC AGENTS"                          
[5] "B — BLOOD AND BLOOD FORMING ORGANS"    

readHTMLTable is not working because it can't read the headers in tables 3 and 4. readHTMLTable无法正常工作,因为它无法读取表3和4中的标题。

url <- "http://www.drugbank.ca/drugs/DB00208"
doc <- htmlParse(readLines(url))
summary(doc)
$nameCounts

      td        a       tr       li       th     span      div        p   strong      img    table ...
     745      399      342      175      159      137       66       49       46       27       27  

#errors
readHTMLTable(doc)
readHTMLTable(doc, which=3)   
# this works
readHTMLTable(doc, which=3, header=FALSE)

Also, ATC codes is not within a nearby table tag, so you have to use xpath like the other answers here. 另外,ATC代码不在附近的表格标签内,因此您必须像此处的其他答案一样使用xpath。

xpathSApply(doc, '//ul[@class="atc-drug-tree"]/*', xmlValue)
[1] "B01AC — Platelet aggregation inhibitors excl. heparin" "B01A — ANTITHROMBOTIC AGENTS"                         
[3] "B01 — ANTITHROMBOTIC AGENTS"                           "B — BLOOD AND BLOOD FORMING ORGANS"     

xpathSApply(doc, '//ul[@class="atc-drug-tree"]/../node()[1]', xmlValue)
[1] "B01AC05"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM