[英]In R how do I pair XML node values from common parent nodes?
我有以下示例XML:
<body>
<div class="row">
<div class="column">
<span class="title">Color</span>
</div>
<div class="column property">Blue</div>
</div>
<div class="row">
<div class="column">
<span class="title">Shape</span>
</div>
<div class="column property">Square</div>
</div>
</body>
我如何使用R將每個標題與其屬性和輸出配對:
Color = Blue
Shape = Square
我嘗試了下面的腳本,但標題周圍有XML標簽,缺少屬性:
library(XML)
getDetails <- function(id) {
html <- htmlTreeParse( "exampleXML.html" ,useInternal = TRUE)
xpathSApply( html , "//div[@class='row']" , function(row) {
print( xmlElementsByTagName(row, "span", recursive = TRUE) )
})
}
getDetails()
也沒有運氣:
library(XML) #to install use: install.packages("XML")
library(xml2) #to install use: install.packages("xml2")
library(magrittr) #to install use: install.packages("magrittr")
extract_info <- function(x){
title <- x %>% xml_find_first(".//span[@class='title']") %>% xml_text
property <- x %>% xml_find_first(".//div[@class='column property']") %>% xml_text
setNames(property, title)
}
html <- htmlTreeParse( "exampleXML.html" ,useInternal = TRUE)
html %>% xml_find_all("//div[@class='row']") %>% extract_info
UseMethod(“xml_find_all”)中的錯誤:沒有適用於“xml_find_all”的方法應用於類“c”(HTMLInternalDocument','HTMLInternalDocument','XMLInternalDocument','XMLAbstractDocument')的對象“
使用xml2
您可以執行以下操作:
library(xml2) #to install use: install.packages("xml2")
library(magrittr) #to install use: install.packages("magrittr")
extract_info <- function(x){
title <- x %>% xml_find_first(".//span[@class='title']") %>% xml_text
property <- x %>% xml_find_first(".//div[@class='column property']") %>% xml_text
setNames(property, title)
}
html <- read_xml( "exampleXML.html" )
html %>% xml_find_all("//div[@class='row']") %>% extract_info
這給你以下命名向量:
Color Shape
"Blue" "Square"
如果您的XML格式正確(即元素的順序沒有改變),那么您可以:
library(xml2)
library(purrr)
doc <- read_xml(txt)
vals <- xml_text(xml_find_all(doc, ".//*[@class='title' or @class='column property']"))
map_chr(seq(1, length(vals), by=2), ~sprintf("%s = %s", vals[.], vals[.+1])) %>%
cat(sep="\n")
同樣。
考慮使用嵌套的xpathSApply()
,其中外部循環遍歷行以解析每行的標題和屬性的相應值:
library(XML)
example_html <- paste0('<body>',
' <div class="row">',
' <div class="column">',
' <span class="title">Color</span>',
' </div>',
' <div class="column property">Blue</div>',
' </div>',
' <div class="row">',
' <div class="column">',
' <span class="title">Shape</span>',
' </div>',
' <div class="column property">Square</div>',
' </div>',
'</body>')
doc <- htmlTreeParse(example_html, useInternal = TRUE)
columns <- xpathSApply(doc, "//div[@class='row']", function(row){
title <- xpathSApply(row, "div[@class='column']/span", xmlValue)
property <- xpathSApply(row, "div[@class='column property']", xmlValue)
setNames(gsub(" ", "", property), gsub(" ", "", title)) # GSUB TO STRIP WHITESPACE
})
columns <- setNames(property, title)
columns
# Color Shape
# "Blue" "Square"
或者,假設行中的嚴格一致性而不丟失子元素或標題和屬性值的多個相同命名元素,請考慮幾個xpathSApply()
調用:
title <- xpathSApply(doc, "//div[@class='column']/span", xmlValue)
property <- xpathSApply(doc, "//div[@class='column property']", xmlValue)
columns <- setNames(property, title)
columns
# Color Shape
# "Blue" "Square"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.