简体   繁体   中英

Conditional count in r data.table with two grouping variables

I have a data.table in which I have records belonging to multiple groupings. I want to count the number of records that fall into the same group for two variables, where the grouping variables may include some NAs.

Example data below:

library(data.table)

mydt <- data.table(id = c(1,2,3,4,5,6), 
                   travel = c("no travel", "morocco", "algeria", 
                              "morocco", "morocco", NA), 
                   cluster = c(1,1,1,2,2,2))

> mydt
   id    travel cluster
1:  1 no travel       1
2:  2   morocco       1
3:  3   algeria       1
4:  4   morocco       2
5:  5   morocco       2
6:  6      <NA>       2


In the above example I want to calculate how many people travelled to each destination by cluster.

Initially I was doing this using the .N notation, as below:

mydt[, ndest1 := as.double(.N), by = c("cluster", "travel")]

> mydt
   id    travel cluster ndest1
1:  1 no travel       1     1
2:  2   morocco       1     1
3:  3   algeria       1     1
4:  4   morocco       2     2
5:  5   morocco       2     2
6:  6      <NA>       2     1

However, NAs are counted as a value - this doesn't work well for my purposes since I later want to identify which destination within each cluster the most people travelled to (morocco in cluster 2 above) using max(...) and if there are a lot of NAs in a given cluster, 'NA' will incorrectly be flagged as the most popular destination.

I then tried using sum() instead, as this is intuitive and also allows me to exclude NAs:

mydt[, ndest2 := sum(!is.na(travel)), by = c("cluster", "travel")]

> mydt
   id    travel cluster ndest1 ndest2
1:  1 no travel       1      1      1
2:  2   morocco       1      1      1
3:  3   algeria       1      1      1
4:  4   morocco       2      2      1
5:  5   morocco       2      2      1
6:  6      <NA>       2      1      0

This gives incorrect results - after a bit of further testing, it seems to be because I have used the same variable for the logic statement within sum(...) as one of the grouping variables in the by statement.

When I use a different variable I get the desired result except that I am not able to exclude NAs this way:

mydt[, ndest3 := sum(!is.na(id)), by = c("cluster", "travel")]

> mydt
   id    travel cluster ndest1 ndest2 ndest3
1:  1 no travel       1      1      1      1
2:  2   morocco       1      1      1      1
3:  3   algeria       1      1      1      1
4:  4   morocco       2      2      1      2
5:  5   morocco       2      2      1      2
6:  6      <NA>       2      1      0      1

This leads me to two questions:

  1. In a data.table conditional count, how do I exclude NAs?
  2. Why can't the same variable be used in the sum logic statemtent and as a grouping variable after by ?

Any insights would be much appreciated.

You can exclude NA s in i

mydt[!is.na(travel), ndest1 := .N, by = .(travel, cluster)][]
#   id    travel cluster ndest1
#1:  1 no travel       1      1
#2:  2   morocco       1      1
#3:  3   algeria       1      1
#4:  4   morocco       2      2
#5:  5   morocco       2      2
#6:  6      <NA>       2     NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM