簡體   English   中英

R復雜的XML到數據框架

[英]R complex xml to data frame

我正在尋找一種將高度復雜的xml文件(太長,因此其底部)轉換為表的方法,該表是從官方的Property Registry獲取並存儲約20.000座建築物的

對於每個“ consulta_dnp”(每個建築物),結果必須為一行,並且這些數據在列中:

<pc1><pc2><car><cc1><cc2><np><nm><luso><sfc><cpt><ant>

另一個問題是無法檢索數據時的錯誤。 它以這種方式存儲:

<consulta_dnp>
  <control>
    <cuerr>1</cuerr>
  </control>
  <lerr>
    <err>
      <cod>4</cod>
      <des>error description</des>
    </err>
  </lerr>
</consulta_dnp>

我對錯誤代碼不感興趣,我只想要一個空白行,“錯誤”或其他內容。

我一直在尋找silimar問題的答案,但我還沒有運氣。

那就是我使用的代碼

doc <- xmlParse("resultado_JA-.txt")

xml_len <- length(getNodeSet(doc,"//consulta_dnp"))

dflist <- lapply(seq(xml_len), function(i){   
  # PARENT NODES   
  d1 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/"))), key=1)
  # CHILD NODES
  d2 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/pc1"))), key=1) 
  d3 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/pc2"))), key=1) 
  d4 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/pc1"))), key=1) 
  d5 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/car"))), key=1) 
  d6 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/cc1"))), key=1) 
  d7 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ibdi/rc/cc2"))), key=1) 
  d8 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/dt/np"))), key=1) 
  d9 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/dt/nm"))), key=1) 
  d10 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/ldt"))), key=1) 
  d11 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/debi/luso"))), key=1) 
  d12 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/debi/sfc"))), key=1) 
  d13 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/debi/cpt"))), key=1) 
  d14 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//consulta_dnp[",i,"]/bico/bi/debi/ant"))), key=1) 

  # MERGE ON KEY, THEN DROP KEY      
  merge(d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, by="key")[-1]    
})

xmldf_JA <- do.call(rbind, dflist)

這段代碼計算了“ consulta_dnp”的正確出現次數,但是始終卡在此代碼上:

  aXPath error : Invalid expression
XPath error : Invalid expression
 Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression //consulta_dnp[1]/ 

任何幫助將不勝感激。

這是代碼(不是真實的數據,而是真實的結構)

<Doc>
 <consulta_dnp>
  <control>
    <cudnp>1</cudnp>
    <cucons>1</cucons>
    <cucul>0</cucul>
  </control>
  <bico>
    <bi>
      <idbi>
        <cn>UR</cn>
        <rc>
          <pc1>0499418</pc1>
          <pc2>VG3709N</pc2>
          <car>0008</car>
          <cc1>R</cc1>
          <cc2>E</cc2>
        </rc>
      </idbi>
      <dt>
        <loine>
          <cp>23</cp>
          <cm>50</cm>
        </loine>
        <cmc>900</cmc>
        <np>VILLACONEJOS DE ARRIBA</np>
        <nm>MALAGA</nm>
        <locs>
          <lous>
            <lourb>
              <dir>
                <cv>799</cv>
                <tv>CL</tv>
                <nv>calle</nv>
                <pnp>2</pnp>
                <snp>0</snp>
              </dir>
              <loint>
                <es>1</es>
                <pt>01</pt>
                <pu>B</pu>
              </loint>
              <dp>29005</dp>
              <dm>1</dm>
            </lourb>
          </lous>
        </locs>
      </dt>
      <ldt>CL calle 2 Es:1 Pl:01 Pt:B 29005 Madrid (Madrid)</ldt>
      <debi>
        <luso>Residencial</luso>
        <sfc>72</sfc>
        <cpt>3,430000</cpt>
        <ant>1979</ant>
      </debi>
    </bi>
    <lcons>
      <cons>
        <lcd>VIVIENDA</lcd>
        <dt>
          <lourb>
            <loint>
              <es>1</es>
              <pt>01</pt>
              <pu>B</pu>
            </loint>
          </lourb>
        </dt>
        <dfcons>
          <stl>72</stl>
        </dfcons>
      </cons>
    </lcons>
  </bico>
</consulta_dnp>
</Doc>
library(xml2)
library(tidyverse)

我將通過以下方法進行嘗試:使用xml2讀取數據,創建用於提取感興趣元素的表達式,然后映射這些元素並將其組合到data.frame。

# the structure of the document (code for data see below)
# I copied the code, so we have one entry, one error, and the first entry repeated
xml
#> {xml_document}
#> <Doc>
#> [1] <consulta_dnp>\n  <control>\n    <cudnp>1</cudnp>\n    <cucons>1</cu ...
#> [2] <consulta_dnp>\n  <control>\n    <cuerr>1</cuerr>\n  </control>\n  < ...
#> [3] <consulta_dnp>\n  <control>\n    <cudnp>1</cudnp>\n    <cucons>1</cu ...

# small helper for extracting the content
extract_child <- function(x, xpath) {
  xml_find_all(x, xpath) %>% 
    xml_text()
}

# our fields of interest
xpath_expressions <- c("pc1", "pc2", "car", "cc1", "cc2", "np", "nm", "luso", "sfc", 
                       "cpt", "ant")


xpath_expressions %>% 
  paste0(".//", .) %>% # search for the expressions from root
  map(~extract_child(xml, .x)) %>% 
  set_names(xpath_expressions) %>% 
  dplyr::bind_rows() %>% 
  type_convert(locale = locale(decimal_mark = ",")) 
#> # A tibble: 2 x 11
#>   pc1     pc2     car   cc1   cc2   np       nm    luso    sfc   cpt   ant
#>   <chr>   <chr>   <chr> <chr> <chr> <chr>    <chr> <chr> <int> <dbl> <int>
#> 1 0499418 VG3709N 0008  R     E     VILLACO… MALA… Resi…    72  3.43  1979
#> 2 0499418 VG3709N 0008  R     E     VILLACO… MALA… Resi…    72  3.43  1979

這種方法“神奇地”起作用,並且錯誤沒有問題,因為僅提取了我們感興趣的那些部分,有錯誤的案例與沒有錯誤的案例之間沒有重疊。 如果您有條目,其中某些字段丟失而其他字段存在,則需要調整代碼。 詳細說明:當缺少整個標簽時,此方法將中斷。 當所有標簽都存在但不包含任何內容(例如<ant></ant> )時,這將導致出現適當的NA

更新資料

即使缺少元素,以下代碼也可以使用,並且應為您的代碼運行。

extract_child <- function(x, xpath) {
  out <- xml_find_all(x, xpath) %>% 
    xml_text()

  if (is_empty(out)) out <- NA_character_

  out
}

# our fields of interest
xpath_expressions <- c("pc1", "pc2", "car", "cc1", "cc2", "np", "nm", "luso", "sfc", 
                       "cpt", "ant")



extract_part <- function(part) {
  xpath_expressions %>% 
    paste0(".//", .) %>% # search for the expressions from root
    map(~extract_child(part, .x)) %>% 
    set_names(xpath_expressions) %>% 
    keep(~any(!is.na(.))) %>% 
    dplyr::bind_rows() %>% 
    type_convert(locale = locale(decimal_mark = ",")) 
}


xml %>% 
  xml_children() %>% 
  map_df(extract_part)

數據

   xml <- read_xml("<Doc>
     <consulta_dnp>
    <control>
    <cudnp>1</cudnp>
    <cucons>1</cucons>
    <cucul>0</cucul>
    </control>
    <bico>
    <bi>
    <idbi>
    <cn>UR</cn>
    <rc>
    <pc1>0499418</pc1>
    <pc2>VG3709N</pc2>
    <car>0008</car>
    <cc1>R</cc1>
    <cc2>E</cc2>
    </rc>
    </idbi>
    <dt>
    <loine>
    <cp>23</cp>
    <cm>50</cm>
    </loine>
    <cmc>900</cmc>
    <np>VILLACONEJOS DE ARRIBA</np>
    <nm>MALAGA</nm>
    <locs>
    <lous>
    <lourb>
    <dir>
    <cv>799</cv>
    <tv>CL</tv>
    <nv>calle</nv>
    <pnp>2</pnp>
    <snp>0</snp>
    </dir>
    <loint>
    <es>1</es>
    <pt>01</pt>
    <pu>B</pu>
    </loint>
    <dp>29005</dp>
    <dm>1</dm>
    </lourb>
    </lous>
    </locs>
    </dt>
    <ldt>CL calle 2 Es:1 Pl:01 Pt:B 29005 Madrid (Madrid)</ldt>
    <debi>
    <luso>Residencial</luso>
    <sfc>72</sfc>
    <cpt>3,430000</cpt>
    <ant>1979</ant>
    </debi>
    </bi>
    <lcons>
    <cons>
    <lcd>VIVIENDA</lcd>
    <dt>
    <lourb>
    <loint>
    <es>1</es>
    <pt>01</pt>
    <pu>B</pu>
    </loint>
    </lourb>
    </dt>
    <dfcons>
    <stl>72</stl>
    </dfcons>
    </cons>
    </lcons>
    </bico>
    </consulta_dnp>
    <consulta_dnp>
      <control>
                    <cuerr>1</cuerr>
                    </control>
                    <lerr>
                    <err>
                    <cod>4</cod>
                    <des>error description</des>
                    </err>
                    </lerr>
                    </consulta_dnp>
     <consulta_dnp>
    <control>
                    <cudnp>1</cudnp>
                    <cucons>1</cucons>
                    <cucul>0</cucul>
                    </control>
                    <bico>
                    <bi>
                    <idbi>
                    <cn>UR</cn>
                    <rc>
                    <pc1>0499418</pc1>
                    <pc2>VG3709N</pc2>
                    <car>0008</car>
                    <cc1>R</cc1>
                    <cc2>E</cc2>
                    </rc>
                    </idbi>
                    <dt>
                    <loine>
                    <cp>23</cp>
                    <cm>50</cm>
                    </loine>
                    <cmc>900</cmc>
                    <np>VILLACONEJOS DE ARRIBA</np>
                    <nm>MALAGA</nm>
                    <locs>
                    <lous>
                    <lourb>
                    <dir>
                    <cv>799</cv>
                    <tv>CL</tv>
                    <nv>calle</nv>
                    <pnp>2</pnp>
                    <snp>0</snp>
                    </dir>
                    <loint>
                    <es>1</es>
                    <pt>01</pt>
                    <pu>B</pu>
                    </loint>
                    <dp>29005</dp>
                    <dm>1</dm>
                    </lourb>
                    </lous>
                    </locs>
                    </dt>
                    <ldt>CL calle 2 Es:1 Pl:01 Pt:B 29005 Madrid (Madrid)</ldt>
                    <debi>
                    <luso>Residencial</luso>
                    <sfc>72</sfc>
                    <cpt>3,430000</cpt>
                    <ant>1979</ant>
                    </debi>
                    </bi>
                    <lcons>
                    <cons>
                    <lcd>VIVIENDA</lcd>
                    <dt>
                    <lourb>
                    <loint>
                    <es>1</es>
                    <pt>01</pt>
                    <pu>B</pu>
                    </loint>
                    </lourb>
                    </dt>
                    <dfcons>
                    <stl>72</stl>
                    </dfcons>
                    </cons>
                    </lcons>
                    </bico>
                    </consulta_dnp>
    </Doc>")

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM