简体   繁体   English

用data.table(R)中的ID变量将列替换为总和

[英]Replacing columns by their sum by an ID variable in data.table (R)

I'm trying to aggregate a certain subset of my variables according to an ID. 我正在尝试根据ID汇总变量的某些子集。 I don't want to store the result as a new variable, because the sum will replace the old variables. 我不想将结果存储为新变量,因为总和将替换旧变量。

I'm looking for a simple way to do this using a data.table. 我正在寻找使用data.table进行此操作的简单方法。

For now, I've got a workaround, and I'm hoping to simplify it if possible (ie one-line it): 现在,我有一个解决方法,我希望尽可能简化它(即,将它一行处理):

sum_vars <- c("x1","x2","x4")
tempp <- dt[ , lapply(.SD, sum), by=ID, .SDcols=sum_vars]
dt[ , c(sum_vars) := NULL]
dt <- dt[tempp]
rm(tempp)

The problems I'm running into with one-lining (to get around creating that temporary variable) are: 我遇到的一线问题(绕开创建该临时变量)是:

tempp is a different size data frame than dt --all duplicates by ID are removed. tempp是与dt不同的大小数据帧-删除了ID所有重复项。 So something like this doesn't work: 所以这样的事情不起作用:

dt[ , sum_vars] <- dt[ , lapply(.SD, sum), by=ID, .SDcols=sum_vars]

Also, the following in-line merge creates new variables with .1 as a suffix (eg x1.1): 另外,以下嵌入式合并创建带有.1后缀的新变量(例如x1.1):

dt <- dt[dt[ , lapply(.SD, sum), by=ID, .SDcols=sum_vars]]

I want something like this to work, but it doesn't: 我想要这样的东西工作,但是不行:

dt[ , .SD:=sum(.SD), by=ID, .SDcols=sum_vars]

But this just creates a variable named .SD 但这只会创建一个名为.SD的变量

Minimialist data example 极简主义数据示例

Start with 从...开始

dt <- structure(list(ID = c(1L, 1L, 2L, 3L), x1 = c(1L, 1L, 1L, 1L), 
                     x2 = c(1L, 2L, 5L, 8L), x3 = c(1L, 3L, 6L, 9L), 
                     x4 = c(1L,  4L, 7L, 2L)), 
                .Names = c("ID", "x1", "x2", "x3", "x4"), 
                class = "data.frame", row.names = c(NA, -4L))
dt
#   ID x1 x2 x3 x4
# 1  1  1  1  1  1
# 2  1  1  2  3  4
# 3  2  1  5  6  7
# 4  3  1  8  9  2

end with

# ID x1 x2 x3 x4
# 1  2   3  4  5
# 1  2   3  4  5
# 2  1   5  6  7
# 3  1   8  9  2

See the data.table Reference Semantics vignette on GitHub: 请参阅GitHub上的data.table参考语义插图

Note that since we allow assignment by reference without quoting column names when there is only one column as explained in Section 2c, we can not do out_cols := lapply(.SD, max). 请注意,由于如第2c节中所述,当只有一列时,我们允许在不引用列名称的情况下进行引用分配,因此我们无法执行out_cols:= lapply(.SD,max)。 That would result in adding one new column named out_col. 这将导致添加一个名为out_col的新列。 Instead we should do either c(out_cols) or simply (out_cols). 相反,我们应该执行c(out_cols)或简单地执行(out_cols)。 Wrapping the variable name with ( is enough to differentiate between the two cases. 用(包裹变量名足以区分这两种情况。

You need to pass the appropriate vector of column names to the LHS of the call to := 您需要将适当的列名向量传递给调用LHS的LHS :=

Therefore the following should work (replacing the values in the original dataset) 因此,以下操作应该有效(替换原始数据集中的值)

dt[,(sum_vars) := lapply(.SD, sum), by = ID, .SDcols = sum_vars]

If you wanted to preserve dt 如果您想保留dt

dt_sum <- copy(dt)[,(sum_vars) := lapply(.SD, sum), by = ID, .SDcols = sum_vars]

Note that in both cases I have wrapped the vector of variable names ( sum_var on the LHS of := ) in () to force this to be evaluated (and not slmply create a column called sum var 请注意,在两种情况下,我都将变量名的向量( :=的LHS上的sum_var包装在()以强制对其进行求值(并且不要轻易创建一个名为sum var的列

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM