Background on goal:
I need to perform calculations on specific individual rows of a large R object that has millions of rows. These calculations involve a series of matrix multiplications. These calculations themselves are optimized to run quickly and further optimization of my code requires overcoming the bottleneck of quickly selecting rows for which we will perform the calculations.
Problem:
Every method I have been able to find for selecting specific rows from a data.table or other R object runs much more slowly than my calculations which are performed on the row. There is a somewhat similar problem here ( Fast subsetting of a matrix in R ) and the recommended solution is to do the calculations on the matrix itself, without extracting rows, in Rcpp. This would require me to rewrite all of my calculations in C++ and I would like to avoid this if there is a sufficiently efficient way to subset rows by index in R.
Example code :
library(dplyr)
library(data.table)
data(mtcars) # for a replicable example
mtcars_data_table <- as.data.table(mtcars) # Convert to data.table
rownames(mtcars) <- seq(1, nrow(mtcars)) # Change row names to numerical index for each row
i = 1 # Set a dummy iterator variable to one to demonstrate code as it would be used inside a for loop
microbenchmark(mtcars_data_table[i,], times=10000) # The data.frame way
microbenchmark(mtcars_data_table[i], times=10000) # The data.table way
microbenchmark(slice(mtcars_data_table, i), times=10000) # The dplyr way
Example results:
Unit: microseconds
expr min lq mean median uq max neval
mtcars_data_table[i, ] 238.923 255.494 282.0608 264.7255 281.4325 24862.53 10000
mtcars_data_table[i] 235.83 249.797 296.2472 255.278 264.972 325892.1 10000
slice(mtcars_data_table, i) 583.154 618.833 642.1725 630.209 639.1015 8099.179 10000
Call for help: The fastest method takes over an order of magnitude longer to run than the calculations performed on the row. If I can't do something about this bottleneck, I can't use R for this. Is there a faster way in R? If nothing in R is faster than the method shown, is there a faster method using Python?
Note: The calculations themselves are not shown as I don't think they're relevant to the question, but they are very similar to the matrix multiplications used for multiple linear regression in matrix form.
I think you can try a matrix way:
t(mtcars_data_table)[,i]
Benchmark
microbenchmark(mtcars_data_table[i,], # The data.frame way
mtcars_data_table[i], # The data.table way
slice(mtcars_data_table, i), # The dplyr way
t(mtcars_data_table)[,i], # the matrix way
times=1000,
unit = "relative")
such that
Unit: relative
expr min lq mean median uq max neval
mtcars_data_table[i, ] 13.584019 9.782873 9.453554 9.368050 9.408398 1.715470 1000
mtcars_data_table[i] 13.593420 9.795455 9.288266 9.276130 9.445648 1.353841 1000
slice(mtcars_data_table, i) 5.024677 4.046672 3.798823 3.848777 3.832374 1.447337 1000
t(mtcars_data_table)[, i] 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.