简体   繁体   中英

Perform rowwise non-vectorized function on a combination of columns in a data.table (R)

I have a very large data.table in R (~200,000) entries, and I want to perform a non-vectorized function to each row . This function needs inputs from two columns of this data.table. The value of one column is linked to another list with each member containing ~1,000,000 numbers. Here is a simplified case with mtcars

#setup a fake list for my function call    
gears <- mtcars %>% arrange(gear) %>% pull(gear) %>% unique
gear_lst <- lapply(gears, function(x){rnorm(1000000, mean = x**2, sd = x*2)}) %>% setNames(.,gears)  

#make a mega data table     
mega_mtcars <- sapply(mtcars, rep.int, times = 10000) %>% as.data.table

#this is the function I want to call    
my_function <- function(x,y){
    sum(x > gear_lst[[y]])
}

# rowwise call is low
out <- mega_mtcars %>% mutate(gear_c = as.character(gear)) %>% rowwise %>% mutate(out = my_function(mpg, gear_c))

One thing I tried is to add a nested column of gear_lst for each gear entry, so that I would be able to perform vectorized function. However, because the list is large, the memory failed to created such a data structure.

Update : @akrun provided a few ways, I wasn't able to test them with my original mega_mtcars because it's too big. I sized it down 100 fold and here is the performance so far (it doesn't seem any improvement over the original rowwise method):

#make a smaller mega_mtcars
mega_mtcars <- sapply(mtcars, rep.int, times = 100) %>% as.data.table

# use rowwise from dplyr
system.time(mega_mtcars %>% rowwise %>% mutate(out = my_function(mpg, as.character(gear))))
   user  system elapsed 
  8.086   2.860  10.941 
    
# use Map with data.table
system.time(mega_mtcars[, out := unlist(Map(my_function, x = mpg, y = as.character(gear)))])
  user  system elapsed 
  7.843   2.815  10.654 
    
# use dapply from collapse package
system.time(dapply(mega_mtcars[, .(mpg, gear)], MARGIN = 1, function(x) my_function(x[1], as.character(x[2]))))
   user  system elapsed 
  7.957   3.167  11.127 

Any other ideas?

With data.table , rowwise can be achieved by grouping over the row sequence

library(data.table)
mega_mtcars[, out := my_function(mpg, as.character(gear)) , 
       by = 1:nrow(mega_mtcars)]

does sorting the values in gear_lst help?

Nice, a challenging question, hm

I have one question too: does this code in data.table run in parallel?

library(data.table)
mega_mtcars[, out := my_function(mpg, as.character(gear)) , 
       by = 1:nrow(mega_mtcars)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM