在R中如何從公共父節點配對XML節點值？

Question

我有以下示例XML：

<body>
  <div class="row">
    <div class="column">
      <span class="title">Color</span>
    </div>
    <div class="column property">Blue</div>
  </div> 
  <div class="row">
    <div class="column">
      <span class="title">Shape</span>
    </div>
    <div class="column property">Square</div>
  </div> 
</body>

我如何使用R將每個標題與其屬性和輸出配對：

Color = Blue
Shape = Square

我嘗試了下面的腳本，但標題周圍有XML標簽，缺少屬性：

library(XML)

getDetails <- function(id) {
  html <- htmlTreeParse( "exampleXML.html" ,useInternal = TRUE)
  xpathSApply( html , "//div[@class='row']" , function(row) { 
    print( xmlElementsByTagName(row, "span", recursive = TRUE) )
  })
}

getDetails()

也沒有運氣：

library(XML)      #to install use: install.packages("XML")
library(xml2)     #to install use: install.packages("xml2")
library(magrittr) #to install use: install.packages("magrittr")

extract_info <- function(x){
   title <- x %>% xml_find_first(".//span[@class='title']") %>% xml_text
   property <- x %>% xml_find_first(".//div[@class='column property']") %>% xml_text
   setNames(property, title)
 }

html <- htmlTreeParse( "exampleXML.html" ,useInternal = TRUE)
html %>% xml_find_all("//div[@class='row']") %>% extract_info

UseMethod（“xml_find_all”）中的錯誤：沒有適用於“xml_find_all”的方法應用於類“c”（HTMLInternalDocument'，'HTMLInternalDocument'，'XMLInternalDocument'，'XMLAbstractDocument'）的對象“

Answer 1

使用xml2您可以執行以下操作：

library(xml2)     #to install use: install.packages("xml2")
library(magrittr) #to install use: install.packages("magrittr")

extract_info <- function(x){
  title <- x %>% xml_find_first(".//span[@class='title']") %>% xml_text
  property <- x %>% xml_find_first(".//div[@class='column property']") %>% xml_text
  setNames(property, title)
}

html <- read_xml( "exampleXML.html" )
html %>% xml_find_all("//div[@class='row']") %>% extract_info

這給你以下命名向量：

   Color    Shape 
  "Blue" "Square"

Answer 2

如果您的XML格式正確（即元素的順序沒有改變），那么您可以：

library(xml2)
library(purrr)

doc <- read_xml(txt)

vals <- xml_text(xml_find_all(doc, ".//*[@class='title' or @class='column property']"))
map_chr(seq(1, length(vals), by=2), ~sprintf("%s = %s", vals[.], vals[.+1])) %>% 
  cat(sep="\n")

同樣。

Answer 3

考慮使用嵌套的xpathSApply() ，其中外部循環遍歷行以解析每行的標題和屬性的相應值：

library(XML)

example_html <- paste0('<body>',
                   '  <div class="row">',
                   '    <div class="column">',
                   '       <span class="title">Color</span>',
                   '    </div>',
                   '    <div class="column property">Blue</div>',
                   '  </div>',
                   '  <div class="row">',
                   '    <div class="column">',
                   '       <span class="title">Shape</span>',
                   '    </div>',
                   '    <div class="column property">Square</div>',
                   '  </div>', 
                   '</body>')

doc <- htmlTreeParse(example_html, useInternal = TRUE)

columns <- xpathSApply(doc, "//div[@class='row']", function(row){
   title <- xpathSApply(row, "div[@class='column']/span", xmlValue)
   property <- xpathSApply(row, "div[@class='column property']", xmlValue)
   setNames(gsub(" ", "", property), gsub(" ", "", title))    # GSUB TO STRIP WHITESPACE
})

columns <- setNames(property, title)
columns
#  Color    Shape 
#  "Blue" "Square"

或者，假設行中的嚴格一致性而不丟失子元素或標題和屬性值的多個相同命名元素，請考慮幾個xpathSApply()調用：

title <- xpathSApply(doc, "//div[@class='column']/span", xmlValue)
property <- xpathSApply(doc, "//div[@class='column property']", xmlValue)

columns <- setNames(property, title)
columns
#   Color    Shape 
#  "Blue" "Square"

在R中如何從公共父節點配對XML節點值？

問題描述

3 個解決方案

解決方案1
1 2016-09-11 07:27:59

解決方案2
1 2016-09-11 11:40:29

解決方案3
1 已采納 2016-09-11 15:06:52

在R中如何從公共父節點配對XML節點值？

問題描述

3 個解決方案

解決方案1 1 2016-09-11 07:27:59

解決方案2 1 2016-09-11 11:40:29

解決方案3 1 已采納 2016-09-11 15:06:52

解決方案1
1 2016-09-11 07:27:59

解決方案2
1 2016-09-11 11:40:29

解決方案3
1 已采納 2016-09-11 15:06:52