简体   繁体   English

在`data.table`中使用动态列名

[英]Using dynamic column names in `data.table`

I want to calculate mean of each of several columns in a data.table, grouped by another column. 我想计算data.table中每个列的平均值,按另一列分组。 My question is similar to two other questions on SO ( one and two ) but I couldn't apply those on my problem. 我的问题类似于关于SO的另外两个问题( 一个两个 ),但我不能将这些问题应用于我的问题。

Here is an example: 这是一个例子:

library(data.table)
dtb <- fread(input = "condition,var1,var2,var3
      one,100,1000,10000
      one,101,1001,10001
      one,102,1002,10002
      two,103,1003,10003
      two,104,1004,10004
      two,105,1005,10005
      three,106,1006,10006
      three,107,1007,10007
      three,108,1008,10008
      four,109,1009,10009
      four,110,1010,10010")

dtb
#    condition var1 var2  var3
# 1:       one  100 1000 10000
# 2:       one  101 1001 10001
# 3:       one  102 1002 10002
# 4:       two  103 1003 10003
# 5:       two  104 1004 10004
# 6:       two  105 1005 10005
# 7:     three  106 1006 10006
# 8:     three  107 1007 10007
# 9:     three  108 1008 10008
# 10:     four  109 1009 10009
# 11:     four  110 1010 10010

The calculation of each single mean is easy; 每个单一均值的计算很容易; eg for "var1": dtb[ , mean(var1), by = condition] . 例如,对于“var1”: dtb[ , mean(var1), by = condition] But I this quickly becomes cumbersome if there are many variables and you need to write all of them. 但是如果有很多变量并且你需要编写所有变量,我很快就会变得很麻烦。 Thus, dtb[, list(mean(var1), mean(var2), mean(var3)), by = condition] is undesirable. 因此, dtb[, list(mean(var1), mean(var2), mean(var3)), by = condition]是不合需要的。 I need the column names to be dynamic and I wish to end up with something like this: 我需要列名称是动态的,我希望最终得到这样的东西:

   condition  var1   var2    var3
1:       one 101.0 1001.0 10001.0
2:       two 104.0 1004.0 10004.0
3:     three 107.0 1007.0 10007.0
4:      four 109.5 1009.5 10009.5

you should use .SDcols (especially if you've too many columns and you require a particular operation to be performed only on a subset of the columns (apart from the grouping variable columns). 你应该使用.SDcols (特别是如果你有太多的列,你需要在列的一个子集上执行特定的操作(除了分组变量列)。

dtb[, lapply(.SD, mean), by=condition, .SDcols=2:4]

#    condition  var1   var2    var3
# 1:       one 101.0 1001.0 10001.0
# 2:       two 104.0 1004.0 10004.0
# 3:     three 107.0 1007.0 10007.0
# 4:      four 109.5 1009.5 10009.5

You could also get all the column names you'd want to take mean of first in a variable and then pass it to .SDcols like this: 您还可以在变量中获取您想要的所有列名称,然后将其传递给.SDcols如下所示:

keys <- setdiff(names(dtb), "condition")
# keys = var1, var2, var3
dtb[, lapply(.SD, mean), by=condition, .SDcols=keys]

Edit: As Matthew Dowle rightly pointed out, since you require mean to be computed on every other column after grouping by condition , you could just do: 编辑:正如Matthew Dowle正确指出的那样,因为在按condition分组后你需要在每个其他列上计算平均值,你可以这样做:

dtb[, lapply(.SD, mean), by=condition]

David's edit: (which got rejected): Read more about .SD from this post . David的编辑:(被拒绝):从这篇文章中了解更多关于.SD 信息 I find this is relevant here. 我发现这与此相关。 Thanks @David. 谢谢@David。

Edit 2: Suppose you have a data.table with 1000 rows and 301 columns (one column for grouping and 300 numeric columns): 编辑2:假设您有一个包含1000行和301列的data.table (一列用于分组和300个数字列):

require(data.table)
set.seed(45)
dt <- data.table(grp = sample(letters[1:15], 1000, replace=T))
m  <- matrix(rnorm(300*1000), ncol=300)
dt <- cbind(dt, m)
setkey(dt, "grp")

and you wanted to find the mean of the columns, say, 251:300 alone, 你想找到列的平均值,比如251:300,

  • you can compute the mean of all the columns and then subset these columns (which is not very efficient as you'll compute on the whole data). 您可以计算所有列的平均值,然后对这些列进行子集(这对于计算整个数据而言效率不高)。

     dt.out <- dt[, lapply(.SD, mean), by=grp] dim(dt.out) # 15 * 301, not efficient. 
  • you can filter the data.table first to just these columns and then compute the mean (which is again not necessarily the best solution as you have to create an extra subset'd data.table every time you want operations on certain columns. 您可以data.table过滤到这些列,然后再计算均值(这也不一定是最佳解决方案,因为每次要对某些列进行操作时都需要创建额外的子集data.table。

     dt.sub <- dt[, c(1, 251:300), with=FALSE] setkey(dt.sub, "grp") dt.out <- dt.sub[, lapply(.SD, mean), by=grp] 
  • you can specify each of the columns one by one as you'd normally do (but this is desirable for smaller data.tables) 您可以像往常一样逐个指定每个列(但这对于较小的data.tables来说是可取的)

     # if you just need one or few columns dt.out <- dt[, list(m.v251 = mean(V251)), by = grp] 

So what's the best solution? 那么什么是最好的解决方案? The answer is .SDcols . 答案是.SDcols

As the documentation states, for a data.table x , .SDcols specifies the columns that are included in .SD . 如文档状态,对于一个data.table的x,.SDcols指定被包括在.SD的列。

This basically implicitly filters the columns that will be passed to .SD instead of creating a subset (as we did before), only it is VERY efficient and FAST! 这基本上隐式地过滤了将传递给.SD而不是创建子集的列(如前所述),只有它非常高效且快速!

How can we do this? 我们应该怎么做?

  • By specifiying either the column numbers: 通过指定列号:

     dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = 251:300] dim(dt.out) # 15 * 51 (what we expect) 
  • Or alternatively by specifying the column id: 或者通过指定列ID:

     ids <- paste0("V", 251:300) # get column ids dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = ids] dim(dt.out) # 15 * 51 (what we expect) 

It accepts both column names and numbers as arguments. 它接受列名和数字作为参数。 In both these cases, .SD will be provided only with these columns we've specified. 在这两种情况下,.SD仅与我们指定的列一起提供。

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM