[英]Using `:=` in data.table to sum the values of two columns in R, ignoring NAs
I have what I think is a very simple question related to the use of data.table and the :=
function. 我认为这是一个与data.table和
:=
函数的使用相关的非常简单的问题。 I don't think I quite understand the behaviour of :=
and often I run into similar problems. 我不认为我完全理解
:=
的行为,而且经常遇到类似的问题。
Here is some example data 这是一些示例数据
mat <- structure(list(
col1 = c(NA, 0, -0.015038, 0.003817, -0.011407),
col2 = c(0.003745, 0.007463, -0.007407, -0.003731, -0.007491)),
.Names = c("col1", "col2"),
row.names = c(NA, 10L),
class = c("data.table", "data.frame"))
which gives 这使
> mat
col1 col2
1: NA 0.003745
2: 0.000000 0.007463
3: -0.015038 -0.007407
4: 0.003817 -0.003731
5: -0.011407 -0.007491
I want to create a column called col3 which gives the sum of col1 and col2. 我想创建一个名为col3的列,它给出col1和col2的总和。 If I use
如果我使用
mat[,col3 := col1 + col2]
# col1 col2 col3
#1: NA 0.003745 NA
#2: 0.000000 0.007463 0.007463
#3: -0.015038 -0.007407 -0.022445
#4: 0.003817 -0.003731 0.000086
#5: -0.011407 -0.007491 -0.018898
then I get an NA for the first row, but I want NAs to be ignored. 然后我得到第一行的NA,但我希望忽略NA。 So I tried instead
所以我尝试了
mat[,col3 := sum(col1,col2,na.rm=TRUE)]
# col1 col2 col3
#1: NA 0.003745 -0.030049
#2: 0.000000 0.007463 -0.030049
#3: -0.015038 -0.007407 -0.030049
#4: 0.003817 -0.003731 -0.030049
#5: -0.011407 -0.007491 -0.030049
which is not what I am after, since it is giving me the sum of all elements of col1 and col2. 这不是我所追求的,因为它给了我col1和col2的所有元素的总和。 I think I don't quite get
:=
... How can I get the sum of the element of col1 and col2 ignoring NA values? 我想我不太明白
:=
...我怎样才能得到col1和col2元素的总和忽略NA值?
Not sure this is relevant, but here is my sessionInfo 不确定这是否相关,但这是我的sessionInfo
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.3
This is standard R
behaviour, nothing really to do with data.table
这是标准的
R
行为,与data.table
没什么data.table
Adding anything to NA
will return NA
向
NA
添加任何内容都将返回NA
NA + 1
## NA
sum
will return a single number sum
将返回一个数字
If you want 1 + NA
to return 1
如果你想
1 + NA
返回1
then you will have to run something like 那么你将不得不运行类似的东西
mat[,col3 := col1 + col2]
mat[is.na(col1), col3 := col2]
mat[is.na(col2), col3 := col1]
To deal with when col1
or col2
are NA
当
col1
或col2
是NA
You could also use rowSums, which has a na.rm
argument 您也可以使用rowSums,它具有
na.rm
参数
mat[ , col3 :=rowSums(.SD, na.rm = TRUE), .SDcols = c("col1", "col2")]
rowSums
is what you want (by definition, the rowSums
of a matrix containing col1
and col2
, removing NA
values rowSums
是你想要的(根据定义,包含col1
和col2
的矩阵的rowSums
,删除NA
值
(@JoshuaUlrich suggested this as a comment ) (@JoshuaUlrich建议将此作为评论)
It's not a lack of understanding of data.table but rather one regarding vectorized functions in R. You can define a dyadic operator that will behave differently than the "+" operator with regard to missing values: 这不是缺乏对data.table的理解,而是关于R中的矢量化函数。您可以定义一个二元运算符,它与缺少值的“+”运算符的行为不同:
`%+na%` <- function(x,y) {ifelse( is.na(x), y, ifelse( is.na(y), x, x+y) )}
mat[ , col3:= col1 %+na% col2]
#-------------------------------
col1 col2 col3
1: NA 0.003745 0.003745
2: 0.000000 0.007463 0.007463
3: -0.015038 -0.007407 -0.022445
4: 0.003817 -0.003731 0.000086
5: -0.011407 -0.007491 -0.018898
You can use mrdwad's comment to do it with sum(... , na.rm=TRUE
): 您可以使用mrdwad的注释来执行
sum(... , na.rm=TRUE
):
mat[ , col4 := sum(col1, col2, na.rm=TRUE), by=1:NROW(mat)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.