[英]How to Subset all nested lists from a specific list with the same names in R
Raw Data is an XML file:原始数据是 XML 文件:
https://drive.google.com/file/d/1WOylIDRVDSicDjPZkDL0FyoyESfHIE_9/view?usp=sharing https://drive.google.com/file/d/1WOylIDRVDSicDjPZkDL0FyoyESfHIE_9/view?usp=sharing
library(XML)
doc <- xmlParse("./p48cierre_01-01-2019.xml")
docList <- xmlToList(doc)
mylist_SeriesTemporales <- sapply(docList, '[', 'SeriesTemporales')
$IdentificacionMensaje.NA
[1] NA
$VersionMensaje.NA
[1] NA
$TipoMensaje.NA
[1] NA
$TipoProceso.NA
[1] NA
$TipoClasificacion.NA
[1] NA
I having all NAs in the List of SeriesTemporales also.我在 SeriesTemporales 列表中也有所有 NA。 I have shared the similar output of others given above.
我已经分享了上面给出的其他类似的 output。 I want to convert all SeriesTemporales lists into a single data frame.
我想将所有 SeriesTemporales 列表转换为单个数据框。 Please help me out.
请帮帮我。
Expected Output:预期 Output:
> xmlDataOut
# A tibble: 30,000 x 7
`Periodo.IntervaloTiempo.Attribute~ `Periodo.Resolucion.Attribute:~ `UnidadMedida.Attribute:~ `UPSalida.Attribute:v` `UPEntrada.Attribute:~ `TipoNegocio.Attribute~ `Periodo.Intervalo.Pos.Attribut~
<chr> <chr> <chr> <lgl> <chr> <chr> <dbl>
1 2021-04-20T22:00Z/2021-04-21T22:00Z PT60M MWH NA ZERBI Z21 1
2 2021-04-20T22:00Z/2021-04-21T22:00Z PT60M MWH NA ZERBI Z21 10
I cannot understand how SeriesTemporales
can be converted in a dataframe when its elements lengths/sizes are different.当元素长度/大小不同时,我无法理解如何在
SeriesTemporales
中转换 SeriesTemporales。 However you can extract all SeriesTemporales
into another list say l2
simply by doing this但是,您可以通过执行此操作将所有
SeriesTemporales
提取到另一个列表中,例如l2
l2 <- docList[names(docList) == 'SeriesTemporales']
Now if first element of l2
is converted to a dataframe, then现在如果
l2
的第一个元素被转换为 dataframe,那么
library(purrr)
map_df(l2, ~.x[1])
# A tibble: 1,256 x 1
IdentificacionSeriesTemporales
<chr>
1 STP0
2 STP1
3 STP2
4 STP3
5 STP4
6 STP5
7 STP6
8 STP7
9 STP8
10 STP9
# ... with 1,246 more rows
But its third element give this但它的第三个元素给出了这个
map_df(l2, ~.x[3])
# A tibble: 2,512 x 2
UPSalida UPEntrada
<chr> <chr>
1 LUBAC01 NA
2 NES NA
3 FUSIC01 NA
4 NES NA
5 NA ECEGRG
6 NA NES
7 NA HYGESTE
8 NA NES
9 GNRAC01 NA
10 NES NA
# ... with 2,502 more rows
This seems like a straight forward xml document to parse.这似乎是一个直截了当的 xml 文档来解析。 The only catch is the information is stored in the node's attributes and not in the node itself.
唯一的问题是信息存储在节点的属性中,而不是节点本身。
Here is a xml2 solution.这是一个 xml2 解决方案。
See Comments for an explanation.有关解释,请参阅评论。
library(xml2)
library(dplyr)
page <- read_xml("p48cierre_01-01-2019.xml")
#check for namespace
xml_ns(page)
#strip namespace
xml_ns_strip(page)
#find all SeriesTeomorales nodes
seriesT <- page %>% xml_find_all(".//SeriesTemporales")
#get requested information from each parent node
# find the correct subnote and attribute
#assuming only one sub node per parent
Intervalo <- seriesT %>% xml_find_first(".//IntervaloTiempo") %>% xml_attr("v")
Resolution <- seriesT %>% xml_find_first(".//Resolucion") %>% xml_attr("v")
UnidadMedida <- seriesT %>% xml_find_first(".//UnidadMedida") %>% xml_attr("v")
UPSalida <- seriesT %>% xml_find_first(".//UPSalida") %>% xml_attr("v")
UPEntrada <- seriesT %>% xml_find_first(".//UPEntrada") %>% xml_attr("v")
TipoNegocio <- seriesT %>% xml_find_first(".//TipoNegocio") %>% xml_attr("v")
#combine into a final answer
head(data.frame(Intervalo, Resolution, UnidadMedida, UPSalida, UPEntrada, TipoNegocio))
I am not sure your request for the "Pos" node, there are 24 per parent node thus does not store conveniently in a single data.frame.我不确定您对“Pos”节点的请求,每个父节点有 24 个因此不能方便地存储在单个 data.frame 中。 If you are just looking for the first one follow the format above, if not maybe and another question.
如果您只是在寻找第一个,请遵循上面的格式,如果不是,可能还有另一个问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.