简体   繁体   English

rbindlist data.frames的列表列并选择唯一值

[英]rbindlist a list column of data.frames and select unique values

I have a data.table 'DT' with a column ('col2') that is a list of data frames: 我有一个data.table'DT',其中一列('col2')是一个数据帧列表:

require(data.table)
DT <- data.table(col1 = c('A','A','B'),
                 col2 = list(data.frame(colA = c(1,3,54, 23), 
                                        colB = c("aa", "bb", "cc", "hh")),
                             data.frame(colA =c(23, 1),
                                       colB = c("hh", "aa")), 
                             data.frame(colA = 1,
                                       colB = "aa")))

> DT
   col1         col2
1:    A <data.frame>
2:    A <data.frame>
3:    B <data.frame>

>> DT$col2
[[1]]
  colA colB
1    1   aa
2    3   bb
3   54   cc
4   23   hh

[[2]]
  colA colB
1   23   hh
2    1   aa

[[3]]
  colA colB
1    1   aa

Each data.frame in col2 has two columns colA and colB. col2中的每个data.frame都有两列colA和colB。 I'd like to have a data.table output that binds each unique row of those data.frames based on col1 of DT. 我想有一个data.table输出,它根据DT的col1绑定那些data.frames的每个唯一行。 I guess it's like using rbindlist in an aggregate function of the data.table. 我想这就像在rbindlist的聚合函数中使用rbindlist一样。

This is the desired output: 这是所需的输出:

> #desired output
> output
   colA colB col1
1:    1   aa    A
2:    3   bb    A
3:   54   cc    A
4:   23   hh    A
5:    1   aa    B

The dataframe of the second row of DT ( DT[2, col2] ) has duplicate entries, and only unique entries are desired for each unique col1. 第二行DT( DT[2, col2] )的数据帧具有重复的条目,并且对于每个唯一的col1仅需要唯一的条目。

I tried the following and I get an error. 我尝试了以下操作,但是收到错误。

desired_output <- DT[, lapply(col2, function(x) unique(rbindlist(x))), by = col1]
# Error in rbindlist(x) : 
#   Item 1 of list input is not a data.frame, data.table or list

This 'works', though not desired output: 这“有效”,但不是理想的输出:

unique(rbindlist(DT$col2))
   colA colB
1:    1   aa
2:    3   bb
3:   54   cc
4:   23   hh

Is there anyway to use rbindlist in an aggregate function of a data.table? 无论如何在rbindlist的聚合函数中使用rbindlist?

Group by 'col1', run rbindlist on 'col2': by 'col1' rbindlist在'col2'上运行rbindlist

unique(DT[ , rbindlist(col2), by = col1]) # trimmed thanks to @snoram
#    col1 colA colB
# 1:    A    1   aa
# 2:    A    3   bb
# 3:    A   54   cc
# 4:    A   23   hh
# 5:    B    1   aa

only unique entries are desired for each unique col1 每个唯一的col1只需要唯一的条目

If you add a column for col1 , the expression above means "unique entries" (unconditional on columns). 如果为col1添加列,则上面的表达式表示“唯一条目”(列上的无条件)。

Henrik's answer is one way to keep col1 . Henrik的答案是保持col1一种方法。 Another is: 另一个是:

unique(DT[, rbindlist(setNames(col2, col1), id="col1")])

I guess this should be more efficient than 我想这应该比效率更高效

bycols = "col1"
unique(DT[, rbindlist(col2), by=bycols])   # Henrik's

though the extension to either (1) col1 not being a character column (hence suitable for setNames ) or (2) having multiple by= columns is not so obvious. 虽然(1) col1不是字符列(因此适用于setNames )或(2)具有多个by=列的扩展名不是那么明显。 For either of these cases, I would make an .id column equal to row numbers of DT then copy them over: 对于这些情况中的任何一种,我会将.id列等于DT行数,然后将它们复制到:

bycols = "col1"
res = unique(DT[, rbindlist(col2, id="DT_row")])
res[, (bycols) := DT[DT_row, ..bycols]]

To put those columns first/leftmost, I think setcolorder(res, bycols) should work, but am on too old a data.table version to see it do so. 要将这些列放在第一个/最左边,我认为setcolorder(res, bycols)应该可以正常工作,但是对于一个data.table版本来说,它看起来太旧了。

There's also an open issue for a tidyr::unnest -like function. 还有一个关于tidyr::unnest like函数的公开问题

You could do something hackish like this: 你可以像这样做一些hackish:

nDT <- cbind(rbindlist(DT[[2]]), col1 = rep(DT[[1]], sapply(DT[[2]], nrow)))
nDT[!duplicated(nDT)]
   colA colB col1
1:    1   aa    A
2:    3   bb    A
3:   54   cc    A
4:   23   hh    A
5:    1   aa    B

Or using tidyr (Inspired by PKumar's comment): 或使用tidyr(灵感来自PKumar的评论):

unique(tidyr::unnest(DT))

Or more generalisable base R: 或者更通用的基础R:

names(DT[[2]]) <- DT[[1]]
ndf <- do.call(rbind, DT[[2]])
ndf$col1 <- substr(row.names(ndf), 1, 1)
unique(ndf)

This works: 这有效:

DT1<-apply(DT, 1, function(x){cbind(col1=x$col1,x$col2)})
unique(rbindlist(DT1))
#   col1 colA colB
#1:    A    1   aa
#2:    A    3   bb
#3:    A   54   cc
#4:    A   23   hh
#5:    B    1   aa

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将值的新列设置为data.frames列表 - set new column of values to list of data.frames 将data.frame列表中的data.frame列名称分配给R中data.frame列表中的其他(空间)data.frames - Assign column names of data.frames in a list of data.frames to other (Spatial) data.frames in a list of data.frames in R 当所有列都是唯一的时,如何组合 data.frames 列表? - How to combine a list of data.frames when all columns are unique? 在列表中的data.frames中创建新列,并使用特定的重复值填充它 - Create new column in data.frames within a list and populate it with specific repeating values 使用lapply在data.frames列表上创建列值的条件和 - Using lapply to create a conditional sum of column values over a list of data.frames R 将列添加到 data.frame 中,即在 data.frames 列表中 - R Add column into data.frame, that is in list of data.frames 在列表中合并data.frames:如何选择多个元素 - Merge data.frames in a list: how to select multiple elements 子集data.frames列表并返回data.frames列表 - Subset a list of data.frames and return list of data.frames 从系数转换为数值将列表中data.frames中的一列转换为数字 - Convert from factor to numeric a column in data.frames within a list 列表-使用lapply重命名特定的data.frames列 - list - rename specific data.frames column with lapply
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM