简体   繁体   中英

R data.table: calculating grouped frequencies

I'm trying to add columns to my data.table that essentially append a cumulative frequency table for each group that is aggregated. Unfortunately, my current solution is about ten times slower than I had hoped.

Here is what I'm using (apologies for the ugly one-liner):

DT[, c("bin1","bin2","bin3","bin4") := as.list(cumsum(hist(colx,c(lbound,bound1,bound2, bound3,ubound),plot=FALSE)$counts)), by=category]

If the bin boundaries are set at 0,25,50,75,100 , I would like my table to look like:

id category colx bin1 bin2 bin3 bin4
1  a        5    1    2    2    3
2  a        30   1    2    2    3
3  b        21   1    2    3    4
4  c        62   0    1    3    3
5  b        36   1    2    3    4
6  a        92   1    2    2    3
7  c        60   0    1    3    3
8  b        79   1    2    3    4
9  b        54   1    2    3    4
10 c        27   0    1    3    3

In the actual dataset I'm grouping using 4 different columns and there are millions of rows and unique groups. When I try a simpler function, such as sum , it takes an acceptable amount of time to do the calculation. Is there any way to significantly speed up the counting process?

Okay, here's one way (here I use data.table v1.9.3 ). Remove the by=.EACHI if you're using versions <= 1.9.2 .

dt[, ival := findInterval(colx, seq(0, 100, by=25), rightmost.closed=TRUE)]
setkey(dt, category, ival)
ans <- dt[CJ(unique(category), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
ans[, N := cumsum(N), by="category"][, bin := "bin"]
ans <- dcast.data.table(ans, category ~ bin+ival, value.var="N")
ans <- dt[ans][, ival := NULL]

    id category colx bin_1 bin_2 bin_3 bin_4
 1:  1        a    5     1     2     2     3
 2:  2        a   30     1     2     2     3
 3:  6        a   92     1     2     2     3
 4:  3        b   21     1     2     3     4
 5:  5        b   36     1     2     3     4
 6:  9        b   54     1     2     3     4
 7:  8        b   79     1     2     3     4
 8: 10        c   27     0     1     3     3
 9:  4        c   62     0     1     3     3
10:  7        c   60     0     1     3     3

Benchmark on simulated large data:

I generate here a data.table with 20 million rows and a total of 1-million groups with 2 grouping columns (instead of 4 as you state in your question).

K = 1e3L
N = 20e6L
sim_data <- function(K, N) {
    set.seed(1L)
    ff <- function(K, N) sample(paste0("V", 1:K), N, TRUE)
    data.table(x=ff(K,N), y=ff(K,N), val=sample(1:100, N, TRUE))
}

dt <- sim_data(K, N)
method1 <- function(x) { 
    dt[, ival := findInterval(val, seq(0, 100, by=25), rightmost.closed=TRUE)]
    setkey(dt, x, y, ival)
    ans <- dt[CJ(unique(x), unique(y), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
    ans[, N := cumsum(N), by="x,y"][, bin := "bin"]
    ans <- dcast.data.table(ans, x+y ~ bin+ival, value.var="N")
    ans <- dt[ans][, ival := NULL]
}

system.time(ans1 <- method1(dt))
#   user  system elapsed 
# 13.148   2.778  16.209 

I hope this is faster than your original solution and scales well for your real data dimensions.


Update: Here's another version using data.table's rolling joins instead of findInterval from base. We've to modify the intervals slightly so that the rolling join finds the right match.

dt <- sim_data(K, N)
method2 <- function(x) {
    ivals = seq(24L, 100L, by=25L)
    ivals[length(ivals)] = 100L
    setkey(dt, x,y,val)
    dt[, ival := seq_len(.N), by="x,y"]
    ans <- dt[CJ(unique(x), unique(y), ivals), roll=TRUE, mult="last"][is.na(ival), ival := 0L][, bin := "bin"]
    ans <- dcast.data.table(ans, x+y~bin+val, value.var="ival")
    dt[, ival := NULL]
    ans2 <- dt[ans]
}

system.time(ans2 <- method2(dt))
#   user  system elapsed 
# 12.538   2.649  16.079 

## check if both methods give identical results:

setkey(ans1, x,y,val)
setnames(ans2, copy(names(ans1)))
setkey(ans2, x,y,val)

identical(ans1, ans2) # [1] TRUE

Edit: Some explanation on why OP's is very time consuming:

A huge reason, I suspect, for the difference in runtime between these solutions and hist is that both the answers here are vectorised (written entirely in C and will work on the whole data set directly), where as hist is a S3 method (which'll take time for dispatch to the .default method and added to that, it's written in R. So, basically you're executing about a million times hist , a function in R, where as the other two vectorised solutions are calling it once in C (no need to call for every group here).

And since that's the most complex part of your question, it obviously slows things down.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM