[英]rbindlist a list column of data.frames and select unique values
I have a data.table 'DT' with a column ('col2') that is a list of data frames: 我有一个data.table'DT',其中一列('col2')是一个数据帧列表:
require(data.table)
DT <- data.table(col1 = c('A','A','B'),
col2 = list(data.frame(colA = c(1,3,54, 23),
colB = c("aa", "bb", "cc", "hh")),
data.frame(colA =c(23, 1),
colB = c("hh", "aa")),
data.frame(colA = 1,
colB = "aa")))
> DT
col1 col2
1: A <data.frame>
2: A <data.frame>
3: B <data.frame>
>> DT$col2
[[1]]
colA colB
1 1 aa
2 3 bb
3 54 cc
4 23 hh
[[2]]
colA colB
1 23 hh
2 1 aa
[[3]]
colA colB
1 1 aa
Each data.frame in col2 has two columns colA and colB. col2中的每个data.frame都有两列colA和colB。 I'd like to have a data.table output that binds each unique row of those data.frames based on col1 of DT.
我想有一个data.table输出,它根据DT的col1绑定那些data.frames的每个唯一行。 I guess it's like using
rbindlist
in an aggregate function of the data.table. 我想这就像在
rbindlist
的聚合函数中使用rbindlist一样。
This is the desired output: 这是所需的输出:
> #desired output
> output
colA colB col1
1: 1 aa A
2: 3 bb A
3: 54 cc A
4: 23 hh A
5: 1 aa B
The dataframe of the second row of DT ( DT[2, col2]
) has duplicate entries, and only unique entries are desired for each unique col1. 第二行DT(
DT[2, col2]
)的数据帧具有重复的条目,并且对于每个唯一的col1仅需要唯一的条目。
I tried the following and I get an error. 我尝试了以下操作,但是收到错误。
desired_output <- DT[, lapply(col2, function(x) unique(rbindlist(x))), by = col1]
# Error in rbindlist(x) :
# Item 1 of list input is not a data.frame, data.table or list
This 'works', though not desired output: 这“有效”,但不是理想的输出:
unique(rbindlist(DT$col2))
colA colB
1: 1 aa
2: 3 bb
3: 54 cc
4: 23 hh
Is there anyway to use rbindlist
in an aggregate function of a data.table? 无论如何在
rbindlist
的聚合函数中使用rbindlist?
Group by
'col1', run rbindlist
on 'col2': by
'col1' rbindlist
在'col2'上运行rbindlist
:
unique(DT[ , rbindlist(col2), by = col1]) # trimmed thanks to @snoram
# col1 colA colB
# 1: A 1 aa
# 2: A 3 bb
# 3: A 54 cc
# 4: A 23 hh
# 5: B 1 aa
only unique entries are desired for each unique
col1
每个唯一的
col1
只需要唯一的条目
If you add a column for col1
, the expression above means "unique entries" (unconditional on columns). 如果为
col1
添加列,则上面的表达式表示“唯一条目”(列上的无条件)。
Henrik's answer is one way to keep col1
. Henrik的答案是保持
col1
一种方法。 Another is: 另一个是:
unique(DT[, rbindlist(setNames(col2, col1), id="col1")])
I guess this should be more efficient than 我想这应该比效率更高效
bycols = "col1"
unique(DT[, rbindlist(col2), by=bycols]) # Henrik's
though the extension to either (1) col1
not being a character column (hence suitable for setNames
) or (2) having multiple by=
columns is not so obvious. 虽然(1)
col1
不是字符列(因此适用于setNames
)或(2)具有多个by=
列的扩展名不是那么明显。 For either of these cases, I would make an .id
column equal to row numbers of DT
then copy them over: 对于这些情况中的任何一种,我会将
.id
列等于DT
行数,然后将它们复制到:
bycols = "col1"
res = unique(DT[, rbindlist(col2, id="DT_row")])
res[, (bycols) := DT[DT_row, ..bycols]]
To put those columns first/leftmost, I think setcolorder(res, bycols)
should work, but am on too old a data.table version to see it do so. 要将这些列放在第一个/最左边,我认为
setcolorder(res, bycols)
应该可以正常工作,但是对于一个data.table版本来说,它看起来太旧了。
There's also an open issue for a tidyr::unnest
-like function. 还有一个关于
tidyr::unnest
like函数的公开问题 。
You could do something hackish like this: 你可以像这样做一些hackish:
nDT <- cbind(rbindlist(DT[[2]]), col1 = rep(DT[[1]], sapply(DT[[2]], nrow)))
nDT[!duplicated(nDT)]
colA colB col1
1: 1 aa A
2: 3 bb A
3: 54 cc A
4: 23 hh A
5: 1 aa B
Or using tidyr (Inspired by PKumar's comment): 或使用tidyr(灵感来自PKumar的评论):
unique(tidyr::unnest(DT))
Or more generalisable base R: 或者更通用的基础R:
names(DT[[2]]) <- DT[[1]]
ndf <- do.call(rbind, DT[[2]])
ndf$col1 <- substr(row.names(ndf), 1, 1)
unique(ndf)
This works: 这有效:
DT1<-apply(DT, 1, function(x){cbind(col1=x$col1,x$col2)})
unique(rbindlist(DT1))
# col1 colA colB
#1: A 1 aa
#2: A 3 bb
#3: A 54 cc
#4: A 23 hh
#5: B 1 aa
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.