I have a data.table
in which I have records belonging to multiple groupings. I want to count the number of records that fall into the same group for two variables, where the grouping variables may include some NAs.
Example data below:
library(data.table)
mydt <- data.table(id = c(1,2,3,4,5,6),
travel = c("no travel", "morocco", "algeria",
"morocco", "morocco", NA),
cluster = c(1,1,1,2,2,2))
> mydt
id travel cluster
1: 1 no travel 1
2: 2 morocco 1
3: 3 algeria 1
4: 4 morocco 2
5: 5 morocco 2
6: 6 <NA> 2
In the above example I want to calculate how many people travelled to each destination by cluster.
Initially I was doing this using the .N
notation, as below:
mydt[, ndest1 := as.double(.N), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1
1: 1 no travel 1 1
2: 2 morocco 1 1
3: 3 algeria 1 1
4: 4 morocco 2 2
5: 5 morocco 2 2
6: 6 <NA> 2 1
However, NAs are counted as a value - this doesn't work well for my purposes since I later want to identify which destination within each cluster the most people travelled to (morocco in cluster 2 above) using max(...)
and if there are a lot of NAs in a given cluster, 'NA' will incorrectly be flagged as the most popular destination.
I then tried using sum()
instead, as this is intuitive and also allows me to exclude NAs:
mydt[, ndest2 := sum(!is.na(travel)), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1 ndest2
1: 1 no travel 1 1 1
2: 2 morocco 1 1 1
3: 3 algeria 1 1 1
4: 4 morocco 2 2 1
5: 5 morocco 2 2 1
6: 6 <NA> 2 1 0
This gives incorrect results - after a bit of further testing, it seems to be because I have used the same variable for the logic statement within sum(...)
as one of the grouping variables in the by
statement.
When I use a different variable I get the desired result except that I am not able to exclude NAs this way:
mydt[, ndest3 := sum(!is.na(id)), by = c("cluster", "travel")]
> mydt
id travel cluster ndest1 ndest2 ndest3
1: 1 no travel 1 1 1 1
2: 2 morocco 1 1 1 1
3: 3 algeria 1 1 1 1
4: 4 morocco 2 2 1 2
5: 5 morocco 2 2 1 2
6: 6 <NA> 2 1 0 1
This leads me to two questions:
by
? Any insights would be much appreciated.
You can exclude NA
s in i
mydt[!is.na(travel), ndest1 := .N, by = .(travel, cluster)][]
# id travel cluster ndest1
#1: 1 no travel 1 1
#2: 2 morocco 1 1
#3: 3 algeria 1 1
#4: 4 morocco 2 2
#5: 5 morocco 2 2
#6: 6 <NA> 2 NA
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.