简体   繁体   中英

R - Fast Mode Function for use in data.table[,lapply(.SD,Mode),by=.()]

I'm summarizing data in a data.table, group by, where I need to take a single value of a variable in a group. I want this value to be the mode of the group. I think it needs to be mode because usually a group is 8 rows and it will have 2 rows at one value and the other 6 or so rows will be another value.

Here's a simplified example, from this:

key1 2
key1 2
key1 2
key1 8
key1 2
key1 2
key1 2
key1 8

I want this:

key1 2

I was having trouble using the standard mode function provided by base R, so I used this solution here: Most frequent value (mode) by group

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

It worked great on my small test data set, but when I run it on my actual data set (22 million rows) it just runs and runs and runs. All my other data.table operations that are similar work great and really fast, but I'm not using a UDF. This is the structure of my data.table query:

ModeCharacterColumns <- ExposureHistory[,lapply(.SD,Mode), .(Key1=Key1, Key2=Key2, ..., key7=key7, key8=key8), .SDcols=('col1','col2','col3', ..., 'col53')]

So I'm guessing my problem is that my UDF is really slowing things down, does anyone have any suggestions where I can accomplish the same goal but get it done much quicker?

Thank you everyone!

EDIT: Better representation of the data:

DT <- fread("key1A key2A key3A key4A 2 2 4 s
             key1A key2A key3A key4A 2 2 4 s  
             key1A key2A key3A key4A 8 8 8 t
             key1A key2A key3A key4A 2 2 4 s
             key1B key2B key3B key4B 6 6 6 v
             key1B key2B key3B key4B 2 2 5 t
             key1B key2B key3B key4B 2 2 5 v
             key1B key2B key3B key4B 2 2 5 v")

And the desired result:

result <- fread("key1A key2A key3A key4A 2 2 4 s
                 key1B key2B key3B key4B 2 2 5 v")

Try using data.table to tabulate the data:

DT <- fread("key1 8
             key1 2
             key1 2
             key1 8
             key1 2
             key1 2
             key1 2
             key1 8")

setkeyv(
  DT[, .N, by = .(V1, V2)], #tabulate
  c("V1", "N") #sort by N
   )[, .(Mode = V2[.N]), by = V1] #most frequent value by V1
#     V1 Mode
#1: key1    2

You need to consider tie-breaking carefully. I might actually use a for loop to apply this to more value columns, but you'd need to provide a representative reproducible example if you want me to try that.

Edit:

Frank provides one option of doing this for several value columns in a comment:

DT[, lapply(.SD, function(x) setDT(list(x = x))[, .N, by=x][order(-N)][1L, x]), by=V1]

However, I believe this copies every value column, which might slow it down too much.

The fastest solution is the function fmode in the collapse package now available on CRAN. It computes a grouped (and optionally weighted) mode in C++, the speed is very satisfying. Syntax:

fmode(x, g = NULL, w = NULL, ...)

where x can be a vector, matrix, data.frame or dplyr grouped tibble, g is a grouping vector or list of grouping vectors, and w is a weight vector. For mixed-type aggregations, the function collap provides a tidy solution. The call

collap(data, ~ id1 + id2, FUN = fmean, catFUN = fmode, ...)

aggregates the data by id1 and id2, applying the mean to all numeric columns in data and the mode to all non-numeric (categorical) columns in data . By default the data is returned with rows sorted and columns in the original order. This solution is faster than data.table , and the collap call can be futher customized.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM