简体   繁体   English

如何对 R 中具有相同名称的特定列表中的所有嵌套列表进行子集化

[英]How to Subset all nested lists from a specific list with the same names in R

如何提取SeriesTemporales的所有元素

Raw Data is an XML file:原始数据是 XML 文件:

https://drive.google.com/file/d/1WOylIDRVDSicDjPZkDL0FyoyESfHIE_9/view?usp=sharing https://drive.google.com/file/d/1WOylIDRVDSicDjPZkDL0FyoyESfHIE_9/view?usp=sharing

library(XML)
doc <- xmlParse("./p48cierre_01-01-2019.xml")
docList <- xmlToList(doc)


mylist_SeriesTemporales <- sapply(docList, '[', 'SeriesTemporales')
$IdentificacionMensaje.NA
[1] NA
$VersionMensaje.NA
[1] NA
$TipoMensaje.NA
[1] NA
$TipoProceso.NA
[1] NA
$TipoClasificacion.NA
[1] NA

I having all NAs in the List of SeriesTemporales also.我在 SeriesTemporales 列表中也有所有 NA。 I have shared the similar output of others given above.我已经分享了上面给出的其他类似的 output。 I want to convert all SeriesTemporales lists into a single data frame.我想将所有 SeriesTemporales 列表转换为单个数据框。 Please help me out.请帮帮我。

Expected Output:预期 Output:

> xmlDataOut
# A tibble: 30,000 x 7
   `Periodo.IntervaloTiempo.Attribute~ `Periodo.Resolucion.Attribute:~ `UnidadMedida.Attribute:~ `UPSalida.Attribute:v` `UPEntrada.Attribute:~ `TipoNegocio.Attribute~ `Periodo.Intervalo.Pos.Attribut~
   <chr>                               <chr>                           <chr>                     <lgl>                  <chr>                  <chr>                                              <dbl>
 1 2021-04-20T22:00Z/2021-04-21T22:00Z PT60M                           MWH                       NA                     ZERBI                  Z21                                                    1
 2 2021-04-20T22:00Z/2021-04-21T22:00Z PT60M                           MWH                       NA                     ZERBI                  Z21                                                   10

I cannot understand how SeriesTemporales can be converted in a dataframe when its elements lengths/sizes are different.当元素长度/大小不同时,我无法理解如何在SeriesTemporales中转换 SeriesTemporales。 However you can extract all SeriesTemporales into another list say l2 simply by doing this但是,您可以通过执行此操作将所有SeriesTemporales提取到另一个列表中,例如l2

l2 <- docList[names(docList) == 'SeriesTemporales']

Now if first element of l2 is converted to a dataframe, then现在如果l2的第一个元素被转换为 dataframe,那么

library(purrr)
map_df(l2, ~.x[1])
# A tibble: 1,256 x 1
   IdentificacionSeriesTemporales
   <chr>                         
 1 STP0                          
 2 STP1                          
 3 STP2                          
 4 STP3                          
 5 STP4                          
 6 STP5                          
 7 STP6                          
 8 STP7                          
 9 STP8                          
10 STP9                          
# ... with 1,246 more rows

But its third element give this但它的第三个元素给出了这个

map_df(l2, ~.x[3])
# A tibble: 2,512 x 2
   UPSalida UPEntrada
   <chr>    <chr>    
 1 LUBAC01  NA       
 2 NES      NA       
 3 FUSIC01  NA       
 4 NES      NA       
 5 NA       ECEGRG   
 6 NA       NES      
 7 NA       HYGESTE  
 8 NA       NES      
 9 GNRAC01  NA       
10 NES      NA       
# ... with 2,502 more rows

This seems like a straight forward xml document to parse.这似乎是一个直截了当的 xml 文档来解析。 The only catch is the information is stored in the node's attributes and not in the node itself.唯一的问题是信息存储在节点的属性中,而不是节点本身。

Here is a xml2 solution.这是一个 xml2 解决方案。

See Comments for an explanation.有关解释,请参阅评论。

library(xml2)
library(dplyr)

page <- read_xml("p48cierre_01-01-2019.xml")

#check for namespace
xml_ns(page)

#strip namespace
xml_ns_strip(page)

#find all SeriesTeomorales nodes
seriesT <- page %>% xml_find_all(".//SeriesTemporales")
   
#get requested information from each parent node
# find the correct subnote and attribute
#assuming only one sub node per parent
Intervalo <-  seriesT %>% xml_find_first(".//IntervaloTiempo") %>% xml_attr("v")
Resolution <- seriesT %>% xml_find_first(".//Resolucion") %>% xml_attr("v")
UnidadMedida <-  seriesT %>% xml_find_first(".//UnidadMedida") %>% xml_attr("v")
UPSalida <-  seriesT %>% xml_find_first(".//UPSalida") %>% xml_attr("v")
UPEntrada <-  seriesT %>% xml_find_first(".//UPEntrada") %>% xml_attr("v")
TipoNegocio <-  seriesT %>% xml_find_first(".//TipoNegocio") %>% xml_attr("v")


#combine into a final answer
head(data.frame(Intervalo, Resolution, UnidadMedida, UPSalida, UPEntrada, TipoNegocio))

I am not sure your request for the "Pos" node, there are 24 per parent node thus does not store conveniently in a single data.frame.我不确定您对“Pos”节点的请求,每个父节点有 24 个因此不能方便地存储在单个 data.frame 中。 If you are just looking for the first one follow the format above, if not maybe and another question.如果您只是在寻找第一个,请遵循上面的格式,如果不是,可能还有另一个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM