For example, I have a dataset (my real dataset has hundreds of columns and millions of observations):
> data
time_period country var1 var2 var3
1: 2000_Q2 USA april spring ABS
2: 2000_Q2 USA may spring CND
3: 2000_Q3 USA june summer ABS
4: 2005_Q2 Italy april spring ABS
5: 2010_Q2 Italy may spring ABS
6: 2010_Q2 Italy may spring ABS
7: 2012_Q2 Italy may spring ABS
8: 2018_Q4 China october autumn ABS
9: 2018_Q4 China september autumn ARF
10: 2018_Q4 China october autumn ABS
11: 2018_Q4 China october autumn ABS
12: 2000_Q2 USA march spring ABS
I need to group the data
by time_period
and country
and find all variables that will have more than 1 unique value per group (at least for one group). In my dataset, all these variables are categorical with type chr.
I did this in the following way:
colnames_to_group = c("time_period", "country", "var1", "var2", "var3")
data_count = data[, lapply(.SD, uniqueN),
by = c("time_period", "country"),
.SDcols = colnames_to_group]
> data_count
time_period country time_period country var1 var2 var3
1: 2000_Q2 USA 1 1 3 1 2
2: 2000_Q3 USA 1 1 1 1 1
3: 2005_Q2 Italy 1 1 1 1 1
4: 2010_Q2 Italy 1 1 1 1 1
5: 2012_Q2 Italy 1 1 1 1 1
6: 2018_Q4 China 1 1 2 1 2
tokeep <- which(sapply(data_count, is.numeric))
multiple_values <- sapply(data_count[ , tokeep, with = FALSE], function(x) any(x > 1))
> multiple_values
time_period country var1 var2 var3
FALSE FALSE TRUE FALSE TRUE
colnames_with_multiple_values <<- names(multiple_values[(multiple_values == TRUE)])
> colnames_with_multiple_values
[1] "var1" "var3"
This works and this is what I want to get. But for huge datasets, this works for a very long time.
Are there any ways to implement this more optimally?
Thanks for the any help!
For very large vectors, uniqueN()
is faster.
But when there're many subgroups as in my task and uniqueN()
is supposed to be called millions of times, the performance's slowing down is significant.
In this case, it is better to use the n_distinct
function from dplyr
.
Unit: seconds
expr min lq mean median uq max neval
data_table 16.692984 16.692984 16.692984 16.692984 16.692984 16.692984 1
dplyr 3.506646 3.506646 3.506646 3.506646 3.506646 3.506646 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.