简体   繁体   中英

Fastest way to extract single row by index from large matrix or data.table in R?

Background on goal:
I need to perform calculations on specific individual rows of a large R object that has millions of rows. These calculations involve a series of matrix multiplications. These calculations themselves are optimized to run quickly and further optimization of my code requires overcoming the bottleneck of quickly selecting rows for which we will perform the calculations.

Problem:
Every method I have been able to find for selecting specific rows from a data.table or other R object runs much more slowly than my calculations which are performed on the row. There is a somewhat similar problem here ( Fast subsetting of a matrix in R ) and the recommended solution is to do the calculations on the matrix itself, without extracting rows, in Rcpp. This would require me to rewrite all of my calculations in C++ and I would like to avoid this if there is a sufficiently efficient way to subset rows by index in R.

Example code :

library(dplyr)
library(data.table)
data(mtcars) # for a replicable example

mtcars_data_table <- as.data.table(mtcars) # Convert to data.table
rownames(mtcars) <- seq(1, nrow(mtcars)) # Change row names to numerical index for each row

i = 1 # Set a dummy iterator variable to one to demonstrate code as it would be used inside a for loop

microbenchmark(mtcars_data_table[i,], times=10000) # The data.frame way
microbenchmark(mtcars_data_table[i], times=10000) # The data.table way
microbenchmark(slice(mtcars_data_table, i), times=10000) # The dplyr way

Example results:

Unit: microseconds
                   expr     min      lq     mean   median       uq      max neval
 mtcars_data_table[i, ] 238.923 255.494 282.0608 264.7255 281.4325 24862.53 10000
 mtcars_data_table[i] 235.83 249.797 296.2472 255.278 264.972 325892.1 10000
 slice(mtcars_data_table, i) 583.154 618.833 642.1725 630.209 639.1015 8099.179 10000

Call for help: The fastest method takes over an order of magnitude longer to run than the calculations performed on the row. If I can't do something about this bottleneck, I can't use R for this. Is there a faster way in R? If nothing in R is faster than the method shown, is there a faster method using Python?

Note: The calculations themselves are not shown as I don't think they're relevant to the question, but they are very similar to the matrix multiplications used for multiple linear regression in matrix form.

I think you can try a matrix way:

t(mtcars_data_table)[,i]

Benchmark

microbenchmark(mtcars_data_table[i,], # The data.frame way 
               mtcars_data_table[i], # The data.table way
               slice(mtcars_data_table, i), # The dplyr way
               t(mtcars_data_table)[,i], # the matrix way
               times=1000,
               unit = "relative")

such that

Unit: relative
                        expr       min       lq     mean   median       uq      max neval
      mtcars_data_table[i, ] 13.584019 9.782873 9.453554 9.368050 9.408398 1.715470  1000
        mtcars_data_table[i] 13.593420 9.795455 9.288266 9.276130 9.445648 1.353841  1000
 slice(mtcars_data_table, i)  5.024677 4.046672 3.798823 3.848777 3.832374 1.447337  1000
   t(mtcars_data_table)[, i]  1.000000 1.000000 1.000000 1.000000 1.000000 1.000000  1000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM