[英]R complex xml to data frame
我正在尋找一種將高度復雜的xml文件(太長,因此其底部)轉換為表的方法,該表是從官方的Property Registry獲取並存儲約20.000座建築物的
對於每個“ consulta_dnp”(每個建築物),結果必須為一行,並且這些數據在列中:
<pc1><pc2><car><cc1><cc2><np><nm><luso><sfc><cpt><ant>
另一個問題是無法檢索數據時的錯誤。 它以這種方式存儲:
<consulta_dnp>
<control>
<cuerr>1</cuerr>
</control>
<lerr>
<err>
<cod>4</cod>
<des>error description</des>
</err>
</lerr>
</consulta_dnp>
我對錯誤代碼不感興趣,我只想要一個空白行,“錯誤”或其他內容。
我一直在尋找silimar問題的答案,但我還沒有運氣。
那就是我使用的代碼
doc <- xmlParse("resultado_JA-.txt")
xml_len <- length(getNodeSet(doc,"//consulta_dnp"))
dflist <- lapply(seq(xml_len), function(i){
# PARENT NODES
d1 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/"))), key=1)
# CHILD NODES
d2 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/pc1"))), key=1)
d3 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/pc2"))), key=1)
d4 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/pc1"))), key=1)
d5 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/car"))), key=1)
d6 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/cc1"))), key=1)
d7 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/cc2"))), key=1)
d8 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/dt/np"))), key=1)
d9 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/dt/nm"))), key=1)
d10 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ldt"))), key=1)
d11 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/debi/luso"))), key=1)
d12 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/debi/sfc"))), key=1)
d13 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/debi/cpt"))), key=1)
d14 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/debi/ant"))), key=1)
# MERGE ON KEY, THEN DROP KEY
merge(d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, by="key")[-1]
})
xmldf_JA <- do.call(rbind, dflist)
這段代碼計算了“ consulta_dnp”的正確出現次數,但是始終卡在此代碼上:
aXPath error : Invalid expression
XPath error : Invalid expression
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces, :
error evaluating xpath expression //consulta_dnp[1]/
任何幫助將不勝感激。
這是代碼(不是真實的數據,而是真實的結構)
<Doc>
<consulta_dnp>
<control>
<cudnp>1</cudnp>
<cucons>1</cucons>
<cucul>0</cucul>
</control>
<bico>
<bi>
<idbi>
<cn>UR</cn>
<rc>
<pc1>0499418</pc1>
<pc2>VG3709N</pc2>
<car>0008</car>
<cc1>R</cc1>
<cc2>E</cc2>
</rc>
</idbi>
<dt>
<loine>
<cp>23</cp>
<cm>50</cm>
</loine>
<cmc>900</cmc>
<np>VILLACONEJOS DE ARRIBA</np>
<nm>MALAGA</nm>
<locs>
<lous>
<lourb>
<dir>
<cv>799</cv>
<tv>CL</tv>
<nv>calle</nv>
<pnp>2</pnp>
<snp>0</snp>
</dir>
<loint>
<es>1</es>
<pt>01</pt>
<pu>B</pu>
</loint>
<dp>29005</dp>
<dm>1</dm>
</lourb>
</lous>
</locs>
</dt>
<ldt>CL calle 2 Es:1 Pl:01 Pt:B 29005 Madrid (Madrid)</ldt>
<debi>
<luso>Residencial</luso>
<sfc>72</sfc>
<cpt>3,430000</cpt>
<ant>1979</ant>
</debi>
</bi>
<lcons>
<cons>
<lcd>VIVIENDA</lcd>
<dt>
<lourb>
<loint>
<es>1</es>
<pt>01</pt>
<pu>B</pu>
</loint>
</lourb>
</dt>
<dfcons>
<stl>72</stl>
</dfcons>
</cons>
</lcons>
</bico>
</consulta_dnp>
</Doc>
library(xml2)
library(tidyverse)
我將通過以下方法進行嘗試:使用xml2
讀取數據,創建用於提取感興趣元素的表達式,然后映射這些元素並將其組合到data.frame。
# the structure of the document (code for data see below)
# I copied the code, so we have one entry, one error, and the first entry repeated
xml
#> {xml_document}
#> <Doc>
#> [1] <consulta_dnp>\n <control>\n <cudnp>1</cudnp>\n <cucons>1</cu ...
#> [2] <consulta_dnp>\n <control>\n <cuerr>1</cuerr>\n </control>\n < ...
#> [3] <consulta_dnp>\n <control>\n <cudnp>1</cudnp>\n <cucons>1</cu ...
# small helper for extracting the content
extract_child <- function(x, xpath) {
xml_find_all(x, xpath) %>%
xml_text()
}
# our fields of interest
xpath_expressions <- c("pc1", "pc2", "car", "cc1", "cc2", "np", "nm", "luso", "sfc",
"cpt", "ant")
xpath_expressions %>%
paste0(".//", .) %>% # search for the expressions from root
map(~extract_child(xml, .x)) %>%
set_names(xpath_expressions) %>%
dplyr::bind_rows() %>%
type_convert(locale = locale(decimal_mark = ","))
#> # A tibble: 2 x 11
#> pc1 pc2 car cc1 cc2 np nm luso sfc cpt ant
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <int>
#> 1 0499418 VG3709N 0008 R E VILLACO… MALA… Resi… 72 3.43 1979
#> 2 0499418 VG3709N 0008 R E VILLACO… MALA… Resi… 72 3.43 1979
這種方法“神奇地”起作用,並且錯誤沒有問題,因為僅提取了我們感興趣的那些部分,有錯誤的案例與沒有錯誤的案例之間沒有重疊。 如果您有條目,其中某些字段丟失而其他字段存在,則需要調整代碼。 詳細說明:當缺少整個標簽時,此方法將中斷。 當所有標簽都存在但不包含任何內容(例如<ant></ant>
)時,這將導致出現適當的NA
。
即使缺少元素,以下代碼也可以使用,並且應為您的代碼運行。
extract_child <- function(x, xpath) {
out <- xml_find_all(x, xpath) %>%
xml_text()
if (is_empty(out)) out <- NA_character_
out
}
# our fields of interest
xpath_expressions <- c("pc1", "pc2", "car", "cc1", "cc2", "np", "nm", "luso", "sfc",
"cpt", "ant")
extract_part <- function(part) {
xpath_expressions %>%
paste0(".//", .) %>% # search for the expressions from root
map(~extract_child(part, .x)) %>%
set_names(xpath_expressions) %>%
keep(~any(!is.na(.))) %>%
dplyr::bind_rows() %>%
type_convert(locale = locale(decimal_mark = ","))
}
xml %>%
xml_children() %>%
map_df(extract_part)
xml <- read_xml("<Doc>
<consulta_dnp>
<control>
<cudnp>1</cudnp>
<cucons>1</cucons>
<cucul>0</cucul>
</control>
<bico>
<bi>
<idbi>
<cn>UR</cn>
<rc>
<pc1>0499418</pc1>
<pc2>VG3709N</pc2>
<car>0008</car>
<cc1>R</cc1>
<cc2>E</cc2>
</rc>
</idbi>
<dt>
<loine>
<cp>23</cp>
<cm>50</cm>
</loine>
<cmc>900</cmc>
<np>VILLACONEJOS DE ARRIBA</np>
<nm>MALAGA</nm>
<locs>
<lous>
<lourb>
<dir>
<cv>799</cv>
<tv>CL</tv>
<nv>calle</nv>
<pnp>2</pnp>
<snp>0</snp>
</dir>
<loint>
<es>1</es>
<pt>01</pt>
<pu>B</pu>
</loint>
<dp>29005</dp>
<dm>1</dm>
</lourb>
</lous>
</locs>
</dt>
<ldt>CL calle 2 Es:1 Pl:01 Pt:B 29005 Madrid (Madrid)</ldt>
<debi>
<luso>Residencial</luso>
<sfc>72</sfc>
<cpt>3,430000</cpt>
<ant>1979</ant>
</debi>
</bi>
<lcons>
<cons>
<lcd>VIVIENDA</lcd>
<dt>
<lourb>
<loint>
<es>1</es>
<pt>01</pt>
<pu>B</pu>
</loint>
</lourb>
</dt>
<dfcons>
<stl>72</stl>
</dfcons>
</cons>
</lcons>
</bico>
</consulta_dnp>
<consulta_dnp>
<control>
<cuerr>1</cuerr>
</control>
<lerr>
<err>
<cod>4</cod>
<des>error description</des>
</err>
</lerr>
</consulta_dnp>
<consulta_dnp>
<control>
<cudnp>1</cudnp>
<cucons>1</cucons>
<cucul>0</cucul>
</control>
<bico>
<bi>
<idbi>
<cn>UR</cn>
<rc>
<pc1>0499418</pc1>
<pc2>VG3709N</pc2>
<car>0008</car>
<cc1>R</cc1>
<cc2>E</cc2>
</rc>
</idbi>
<dt>
<loine>
<cp>23</cp>
<cm>50</cm>
</loine>
<cmc>900</cmc>
<np>VILLACONEJOS DE ARRIBA</np>
<nm>MALAGA</nm>
<locs>
<lous>
<lourb>
<dir>
<cv>799</cv>
<tv>CL</tv>
<nv>calle</nv>
<pnp>2</pnp>
<snp>0</snp>
</dir>
<loint>
<es>1</es>
<pt>01</pt>
<pu>B</pu>
</loint>
<dp>29005</dp>
<dm>1</dm>
</lourb>
</lous>
</locs>
</dt>
<ldt>CL calle 2 Es:1 Pl:01 Pt:B 29005 Madrid (Madrid)</ldt>
<debi>
<luso>Residencial</luso>
<sfc>72</sfc>
<cpt>3,430000</cpt>
<ant>1979</ant>
</debi>
</bi>
<lcons>
<cons>
<lcd>VIVIENDA</lcd>
<dt>
<lourb>
<loint>
<es>1</es>
<pt>01</pt>
<pu>B</pu>
</loint>
</lourb>
</dt>
<dfcons>
<stl>72</stl>
</dfcons>
</cons>
</lcons>
</bico>
</consulta_dnp>
</Doc>")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.