简体   繁体   English

R中数据框的列表列表

[英]List of lists to dataframe in R

I have to cope with an ugly list called ul that looks like this: 我必须处理一个名为ul的丑陋清单,看起来像这样:

[[1]]
[[1]]$param
     name     value 
"Section"       "1" 

[[1]]$param
   name   value 
"field"     "1" 

[[1]]$param
          name          value 
"final answer"            "1" 

[[1]]$param
    name    value 
"points"   "-0.0" 


[[2]]
[[2]]$param
     name     value 
"Section"       "1" 

[[2]]$param
   name   value 
"field"     "2" 

[[2]]$param
          name          value 
"final answer"            "1" 

[[2]]$param
    name    value 
"points"    "1.0" 


[[3]]
[[3]]$param
     name     value 
"Section"       "1" 

[[3]]$param
   name   value 
"field"     "3" 

[[3]]$param
          name          value 
"final answer"        "0.611" 

[[3]]$param
    name    value 
"points"    "1.0" 

I would like to convert the list to a simple data frame, ie 我想将列表转换为简单的数据框,即

Section    field    final answer    points
      1        1               1      -0.0
      1        2               1       1.0
      1        3           0.611       1.0

Is there any straightforward way to achieve that? 有没有简单的方法可以实现这一目标? or do I have to make a function accessing each list individually and binding it to a dataframe? 还是我必须让一个函数单独访问每个列表并将其绑定到数据框?

The data is imported from an uglier xml file, so if someone wants to play with it there is a link to the RData file . 数据是从较丑陋的xml文件中导入的,因此,如果有人想玩它,可以找到RData文件的链接。 Sorry for not having reproducible code. 抱歉,没有可复制的代码。 Thank you very much. 非常感谢你。

There is probably a better solution, but this should get you started. 可能有更好的解决方案,但这应该可以帮助您入门。 First, we load some libraries 首先,我们加载一些库

R> library(plyr)
R> library(reshape2)

Then handle your lists in two parts. 然后分两部分处理您的列表。

##lapply applies ldply to each list element in turn
ul1 = lapply(ul, ldply)

##We then do the same again
dd = ldply(ul1)[,2:3]

Next we label output according to their list order 接下来,我们根据输出的列表顺序标记输出

R> dd$num = rep(1:3, each=4)

Then we convert from long to wide format 然后我们从长格式转换为宽格式

R> dcast(dd, num ~ name)

  num field final answer points Section
1   1     1            1   -0.0       1
2   2     2            1    1.0       1
3   3     3        0.611    1.0       1

An answer to a similar problem was given by Marc Schwartz at this link : https://stat.ethz.ch/pipermail/r-help/2006-August/111368.html 马克·施瓦茨(Marc Schwartz)在以下链接上给出了对类似问题的答案: https : //stat.ethz.ch/pipermail/r-help/2006-August/111368.html

I'm copying it in case the link is deleted. 我正在复制它,以防链接被删除。

 as.data.frame(sapply(a, rbind))

   V1 V2 V3
1  a  b  c
2  1  3  5
3  2  4  6

or: 要么:

as.data.frame(t(sapply(a, rbind)))
   V1 V2 V3
1  a  1  2
2  b  3  4
3  c  5  6

As the structure of the ul is consistent, you can simply get each column individually (using only base R): 由于ul的结构是一致的,因此您可以简单地单独获取每一列(仅使用基数R):

section <- vapply(ul, function(x) as.numeric(x[[1]][2]), 0)
field <- vapply(ul, function(x) as.numeric(x[[2]][2]), 0)
final_answer <- vapply(ul, function(x) as.numeric(x[[3]][2]), 0)
points <- vapply(ul, function(x) as.numeric(x[[4]][2]), 0)

(Note, I use vapply instead of sapply as it is faster and reliably returns a vector, which is needed here). (请注意,我使用vapply而不是sapply因为它更快并且可靠地返回了向量,这在这里是必需的)。
Then you can simply put it all together: 然后,您可以将所有内容放在一起:

> data.frame(section, field, final_answer, points)
  section field final_answer points
1       1     1        1.000      0
2       1     2        1.000      1
3       1     3        0.611      1

Note that I transformed everything into numeric . 请注意,我将所有内容都转换为numeric If you want to retain everything as characters, delete the as.numeric and exchange 0 with "" in each call to vapply . 如果要将所有内容保留为字符,请删除as.numeric并在每次对vapply调用vapply 0替换为""


Late update: 后期更新:

There is actually a nice oneliner that extracts the complete data: 实际上,有一个不错的oneliner可以提取完整的数据:

do.call("rbind", lapply(ul, function(x) as.numeric(vapply(x, "[", i = 2, ""))))

which gives: 这使:

     [,1] [,2]  [,3] [,4]
[1,]    1    1 1.000    0
[2,]    1    2 1.000    1
[3,]    1    3 0.611    1

to get the colnames use: 要获得colnames使用:

> vapply(ul[[1]], "[", i = 1, "")
         param          param          param          param 
     "Section"        "field" "final answer"       "points" 

I'm not sure what you mean by "a function accessing each list individually", but this is pretty straightforward using "lapply" and "do.call('rbind',...)": 我不确定“一个函数分别访问每个列表”是什么意思,但是使用“ lapply”和“ do.call('rbind',...)”非常简单:

I couldn't load your .RData file, so this code works for the list: 我无法加载您的.RData文件,因此此代码适用于该列表:

ul <- list(param = list(
             c(name = "Section", value = "1"),
             c(name = "field", value = "1"),
             c(name = "final answer", value = "1"),
             c(name = "points", value = "-0.0")),
           param = list(
             c(name = "Section", value = "1"),
             c(name = "field", value = "2"),
             c(name = "final answer", value = "1"),
             c(name = "points", value = "1.0")))

You may have to tweak the details if your list is different; 如果您的列表不同,则可能需要调整细节。 the general principal will remain the same. 一般负责人将保持不变。 Just to keep the code clean, let's define the 'extractitem' function that's going to pull out all of the names or values for ul[[1]], ul[[2]], etc. This function is a little more general than you need. 为了保持代码干净,让我们定义“提取”功能,该功能将提取出ul [[1]],ul [[2]]等的所有名称或值。该函数比你需要。

extractitem <- function(listelement, item)
  unname(lapply(listelement, function(itemblock) itemblock[item]))

Now we'll just use lapply to walk through ul element by element; 现在,我们将使用lapply逐个元素地遍历ul; for each element, we extract the values into a data frame, then name the columns according to the 'names'. 对于每个元素,我们将值提取到数据框中,然后根据“名称”命名列。

rowlist <- lapply(ul, function(listelement) {
  d <- data.frame(extractitem(listelement, "value"), stringsAsFactors = FALSE)
  names(d) <- unlist(extractitem(listelement, "name"))
  d
})

rowlist is now a list of data frames; 现在,行列表是数据帧的列表; we can consolidate them into a single data frame with 'rbind'. 我们可以使用“ rbind”将它们合并为一个数据框。 The nice thing about using data frames in the previous step (as opposed to vectors or something with lower overhead) is that rbind will reorder the columns if necessary, so if the order of the fields changes from element to element, we're still all right. 在上一步中使用数据帧的好处(相对于矢量或开销较低的东西)是rbind会在必要时对列进行重新排序,因此,如果字段的顺序在元素之间变化,那么我们仍然对。

finaldf <- do.call("rbind", rowlist)

We still need to change the elements fo finaldf from "character" to whatever's appropriate for your application through, eg 我们仍然需要通过以下方式将finaldf的元素从“字符”更改为适合您的应用程序的任何元素:

finaldf$points <- as.numeric(finaldf$points)

and so on. 等等。 The last step cleans up the data frame by stripping the automatically-generated row names: 最后一步是通过剥离自动生成的行名称来清理数据框:

rownames(finaldf) <- NULL

In case you need to tweak things, the general idea is to write a function that will format each ul[[i]] as a data frame with the correct column names; 万一您需要进行调整,通常的想法是编写一个函数,该函数会将每个ul [[i]]格式化为具有正确列名的数据帧。 then invoke that function on each element of ul with lapply; 然后使用lapply在ul的每个元素上调用该函数; and finally collapse the resulting list with do.call("rbind",...). 最后使用do.call(“ rbind”,...)折叠结果列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM