用data.table（R）中的ID变量将列替换为总和

Question

I'm trying to aggregate a certain subset of my variables according to an ID. 我正在尝试根据ID汇总变量的某些子集。 I don't want to store the result as a new variable, because the sum will replace the old variables. 我不想将结果存储为新变量，因为总和将替换旧变量。

I'm looking for a simple way to do this using a data.table. 我正在寻找使用data.table进行此操作的简单方法。

For now, I've got a workaround, and I'm hoping to simplify it if possible (ie one-line it): 现在，我有一个解决方法，我希望尽可能简化它（即，将它一行处理）：

sum_vars <- c("x1","x2","x4")
tempp <- dt[ , lapply(.SD, sum), by=ID, .SDcols=sum_vars]
dt[ , c(sum_vars) := NULL]
dt <- dt[tempp]
rm(tempp)

The problems I'm running into with one-lining (to get around creating that temporary variable) are: 我遇到的一线问题（绕开创建该临时变量）是：

tempp is a different size data frame than dt --all duplicates by ID are removed. tempp是与dt不同的大小数据帧-删除了ID所有重复项。 So something like this doesn't work: 所以这样的事情不起作用：

dt[ , sum_vars] <- dt[ , lapply(.SD, sum), by=ID, .SDcols=sum_vars]

Also, the following in-line merge creates new variables with .1 as a suffix (eg x1.1): 另外，以下嵌入式合并创建带有.1后缀的新变量（例如x1.1）：

dt <- dt[dt[ , lapply(.SD, sum), by=ID, .SDcols=sum_vars]]

I want something like this to work, but it doesn't: 我想要这样的东西工作，但是不行：

dt[ , .SD:=sum(.SD), by=ID, .SDcols=sum_vars]

But this just creates a variable named .SD 但这只会创建一个名为.SD的变量

Minimialist data example 极简主义数据示例

Start with 从...开始

dt <- structure(list(ID = c(1L, 1L, 2L, 3L), x1 = c(1L, 1L, 1L, 1L), 
                     x2 = c(1L, 2L, 5L, 8L), x3 = c(1L, 3L, 6L, 9L), 
                     x4 = c(1L,  4L, 7L, 2L)), 
                .Names = c("ID", "x1", "x2", "x3", "x4"), 
                class = "data.frame", row.names = c(NA, -4L))
dt
#   ID x1 x2 x3 x4
# 1  1  1  1  1  1
# 2  1  1  2  3  4
# 3  2  1  5  6  7
# 4  3  1  8  9  2

end with 以

# ID x1 x2 x3 x4
# 1  2   3  4  5
# 1  2   3  4  5
# 2  1   5  6  7
# 3  1   8  9  2

Answer 1

See the data.table Reference Semantics vignette on GitHub: 请参阅GitHub上的data.table参考语义插图：

Note that since we allow assignment by reference without quoting column names when there is only one column as explained in Section 2c, we can not do out_cols := lapply(.SD, max). 请注意，由于如第2c节中所述，当只有一列时，我们允许在不引用列名称的情况下进行引用分配，因此我们无法执行out_cols：= lapply（.SD，max）。 That would result in adding one new column named out_col. 这将导致添加一个名为out_col的新列。 Instead we should do either c(out_cols) or simply (out_cols). 相反，我们应该执行c（out_cols）或简单地执行（out_cols）。 Wrapping the variable name with ( is enough to differentiate between the two cases. 用（包裹变量名足以区分这两种情况。

You need to pass the appropriate vector of column names to the LHS of the call to := 您需要将适当的列名向量传递给调用LHS的LHS :=

Therefore the following should work (replacing the values in the original dataset) 因此，以下操作应该有效（替换原始数据集中的值）

dt[,(sum_vars) := lapply(.SD, sum), by = ID, .SDcols = sum_vars]

If you wanted to preserve dt 如果您想保留dt

dt_sum <- copy(dt)[,(sum_vars) := lapply(.SD, sum), by = ID, .SDcols = sum_vars]

Note that in both cases I have wrapped the vector of variable names ( sum_var on the LHS of := ) in () to force this to be evaluated (and not slmply create a column called sum var 请注意，在两种情况下，我都将变量名的向量（ :=的LHS上的sum_var包装在()以强制对其进行求值（并且不要轻易创建一个名为sum var的列

用data.table（R）中的ID变量将列替换为总和

问题描述

1 个解决方案

解决方案1
5 已采纳 2014-07-29 03:14:43

用data.table（R）中的ID变量将列替换为总和

问题描述

1 个解决方案

解决方案1 5 已采纳 2014-07-29 03:14:43

解决方案1
5 已采纳 2014-07-29 03:14:43