简体   繁体   中英

Find columns with multiple unique values per group when grouping by other variables in R

For example, I have a dataset (my real dataset has hundreds of columns and millions of observations):

> data
    time_period country      var1   var2 var3
 1:     2000_Q2     USA     april spring  ABS
 2:     2000_Q2     USA       may spring  CND
 3:     2000_Q3     USA      june summer  ABS
 4:     2005_Q2   Italy     april spring  ABS
 5:     2010_Q2   Italy       may spring  ABS
 6:     2010_Q2   Italy       may spring  ABS
 7:     2012_Q2   Italy       may spring  ABS
 8:     2018_Q4   China   october autumn  ABS
 9:     2018_Q4   China september autumn  ARF
10:     2018_Q4   China   october autumn  ABS
11:     2018_Q4   China   october autumn  ABS
12:     2000_Q2     USA     march spring  ABS

I need to group the data by time_period and country and find all variables that will have more than 1 unique value per group (at least for one group). In my dataset, all these variables are categorical with type chr.
I did this in the following way:

colnames_to_group = c("time_period", "country", "var1", "var2", "var3")
data_count = data[, lapply(.SD, uniqueN), 
                  by = c("time_period", "country"), 
                  .SDcols = colnames_to_group]
> data_count
   time_period country time_period country var1 var2 var3
1:     2000_Q2     USA           1       1    3    1    2
2:     2000_Q3     USA           1       1    1    1    1
3:     2005_Q2   Italy           1       1    1    1    1
4:     2010_Q2   Italy           1       1    1    1    1
5:     2012_Q2   Italy           1       1    1    1    1
6:     2018_Q4   China           1       1    2    1    2

tokeep <- which(sapply(data_count, is.numeric))
multiple_values <- sapply(data_count[ , tokeep, with = FALSE], function(x) any(x > 1))

> multiple_values
time_period     country        var1        var2        var3 
      FALSE       FALSE        TRUE       FALSE        TRUE 

colnames_with_multiple_values <<- names(multiple_values[(multiple_values == TRUE)]) 

> colnames_with_multiple_values
[1] "var1" "var3"

This works and this is what I want to get. But for huge datasets, this works for a very long time.
Are there any ways to implement this more optimally?
Thanks for the any help!

For very large vectors, uniqueN() is faster.
But when there're many subgroups as in my task and uniqueN() is supposed to be called millions of times, the performance's slowing down is significant.
In this case, it is better to use the n_distinct function from dplyr .

Unit: seconds
       expr       min        lq      mean    median        uq       max neval
 data_table 16.692984 16.692984 16.692984 16.692984 16.692984 16.692984     1
      dplyr  3.506646  3.506646  3.506646  3.506646  3.506646  3.506646     1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM