简体   繁体   中英

More efficient way to calculate runs in subgroups

Calculate rank of each element in subgroup

I am looking to add a column to a data frame which, for each element in a subset of a combination of the columns, writes the rank of that element.

This works but is inefficient:

The code below solves this problem, but I am looking to do it in a more memory and CPU efficient way.

## using the plyr package
library(plyr)


## example data
var1 = c(1,1,1,1,2,2,1,5,6,7,1,9,10)
var2 = c("a","a","a","b","b", "b","c","c","c","c","a","a","a" )
ex1 <- data.frame( var1, var2 )



## easy but inefficient solution
ex2 <- ddply( ex1, c("var1", "var2"), transform,  run = 1:length(var1) )

print(ex2)

The output looks like this (which is what I want)

"var1"  "var2"  "run"
    1   "a" 1
    1   "a" 2
    1   "a" 3
    1   "a" 4
    1   "b" 1
    1   "c" 1
    2   "b" 1
    2   "b" 2
    5   "c" 1
    6   "c" 1
    7   "c" 1
    9   "a" 1
    10  "a" 1

Explanation of the output (this might be so obvious that the explanation is confusing):

The combination var1==1 & var2=="a" occured 4 times. In this subgroup ddply counts the rank of each element, and save the rank in the same row of that element. The first time the combination occurs run[1] gets "1", the second time it occurs run[2] gets "2", etc...

edit

In my example, the result is reordered by the ddply-function but this is not important.

You can do it using dplyr like this:

require(dplyr)

ex1 %>% group_by(var1, var2) %>% mutate(run = 1:n()) %>% arrange(var1, var2)
#   var1 var2 run
#1     1    a   1
#2     1    a   2
#3     1    a   3
#4     1    a   4
#5     1    b   1
#6     1    c   1
#7     2    b   1
#8     2    b   2
#9     5    c   1
#10    6    c   1
#11    7    c   1
#12    9    a   1
#13   10    a   1

The arrange is only to get it in the order of your desired result.

And I think this is how you could do it using data.table but I'm not sure if that's the most idiomatic data.table approach:

require(data.table)

setDT(ex1)[,run:=1:.N, by=list(var1, var2)]
#   var1 var2 run
#1:    1    a   1
#2:    1    a   2
#3:    1    a   3
#4:    1    b   1
#5:    2    b   1
#6:    2    b   2
#7:    1    c   1
#8:    5    c   1
#9:    6    c   1
#10:    7    c   1
#11:    1    a   4
#12:    9    a   1
#13:   10    a   1

Edit:

As @DavidArenburg suggested in his comment, it would be better to use:

setDT(ex1)[,run:=seq_len(.N), by=list(var1, var2)]

for the data.table approach. Thanks for the comment!

ave works for this:

ex1$run <- ave(ex1$var1, list(ex1$var1, ex1$var2), FUN=seq_along)
ex1
   var1 var2 run
1     1    a   1
2     1    a   2
3     1    a   3
4     1    b   1
5     2    b   1
6     2    b   2
7     1    c   1
8     5    c   1
9     6    c   1
10    7    c   1
11    1    a   4
12    9    a   1
13   10    a   1

Note that the rows are not reordered.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM