简体   繁体   English

如何`data.frame`具有不同的行数但相关(不是`by`)

[英]How to `data.frame` with different number of rows but related (not `by`)

Here is the sample of the XML format in my dataset. 以下是我的数据集中XML格式的示例。

<info>
    <a>1990-01-02T06:58:12+08:00</a>
    <b>120.980</b>
    <c>23.786</c>
    <d>18.7</d>
    <e>2</e>
</info>
<info>
    <a>1990-02-02T06:58:12+08:00</a>
    <b>120.804</b>
    <c>23.790</c>
</info>

But the numbers of tag is not same as tag , for example there are 4000 rows tag a, b, c, and only 3950 rows for tag d, e 但是标签的数量与标签不同,例如标签a,b,c有4000行,标签d,e只有3950行

Here is my code in R 这是我在R中的代码

library(xml2)

data.frame(Time = xml_text(xml_find_all(xml_data, ".//a")),
           Num = xml_text(xml_find_all(xml_data, ".//b")),
           Dist = xml_text(xml_find_all(xml_data, ".//c")),
           Gap = xml_text(xml_find_all(xml_data, ".//d")),
           Type = xml_text(xml_find_all(xml_data, ".//e")),
           stringsAsFactors = F)
}) -> df

The error message is: (I knew this will happened) 错误消息是:(我知道会发生这种情况)

arguments imply differing number of rows 参数意味着不同的行数

The output I want will be like the table below: 我想要的输出将如下表所示:

Time                       Num      Dist   Gap   Type
1990-01-02T06:58:12+08:00  120.980  23.786 18.7  2
1990-02-02T06:58:12+08:00  120.804  23.790 <NA>  <NA>
...
1993-03-03T08:42:15+08:00  120.412  23.523 <NA>  1

Which function or library should I try for this? 我应该尝试哪种功能或库?
Thanks for helping me !! 谢谢你的帮助!!

I have tried some another method like map_if 我尝试了另一种方法,如map_if

Finally I found the solution!! 最后我找到了解决方案!!

Once we are using the xml file, be sure to get the root node of the records at first. 一旦我们使用xml文件,请务必首先获取记录的根节点。

Here I will show you how it works. 在这里,我将向您展示它是如何工作的。

Take this xml file for example: (name it to test.xml) 以此xml文件为例:(将其命名为test.xml)

<dataset>
  <dataset_info>
    <data_count>2</data_count>
    <status>Actual</status>
  </dataset_info>
  <data>
    <time>2019-06-01</time>
    <event>event1</event>
    <describe>describe for event1</describe>
  </data>
  <data>
    <time>2019-06-02</time>
    <event>event2</event>
  </data>
</dataset>

We know that there is a tag describe missing in event2, but we hope to make data frame by this xml data. 我们知道event2 describe缺少一个标签describe ,但我们希望通过这个xml数据创建数据框。 I was taught to use the function xml2::xml_find_all to get the value in the selected tag. 我被教导使用函数xml2::xml_find_all来获取所选标记中的值。 By the R code like this: 通过这样的R代码:

# library import
library(xml) #require(xml2)

# file reading
xml <- read_xml("path/where/the/file/is/test.xml")


data.frame(Time = xml_text(xml_find_all(xml, ".//time"))
           Event = xml_text(xml_find_all(xml, ".//event"))
           Describe = xml_text(xml_find_all(xml, ".//describe"))
           )

Then we will get error message arguments imply differing number of rows 然后我们将获得错误消息arguments imply differing number of rows

So what we need to do is get the root of records first!! 所以我们需要做的就是先获取记录的根源! As the code below: 如下面的代码:

# library import
library(xml) #require(xml2)

# file reading
xml <- read_xml("path/where/the/file/is/test.xml")
record <- xml_find_all(xml, ".//data")


data.frame(Time = xml_text(xml_find_all(record, ".//time"))
           Event = xml_text(xml_find_all(record, ".//event"))
           Describe = xml_text(xml_find_all(record, ".//describe"))
           )

After adding record <- xml_find_all(xml, ".//data") , we will no longer get the error cause by different counting of the results. 添加record <- xml_find_all(xml, ".//data") ,我们将不再通过不同的结果计数得到错误原因。

Hope this can help !! 希望这可以帮助!!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM