简体   繁体   中英

Fastest way to sort each row of a large matrix in R

I have a large matrix:

set.seed(1)
a <- matrix(runif(9e+07),ncol=300)

I want to sort each row in the matrix:

> system.time(sorted <- t(apply(a,1,sort)))
   user  system elapsed 
  42.48    3.40   45.88 

I have a lot of RAM to work with, but I would like a faster way to perform this operation.

Well, I'm not aware of that many ways to sort faster in R, and the problem is that you're only sorting 300 values, but many times. Still, you can eek some extra performance out of sort by directly calling sort.int and using method='quick' :

set.seed(1)
a <- matrix(runif(9e+07),ncol=300)

# Your original code
system.time(sorted <- t(apply(a,1,sort))) # 31 secs

# sort.int with method='quick'
system.time(sorted2 <- t(apply(a,1,sort.int, method='quick'))) # 27 secs

# using a for-loop is slightly faster than apply (and avoids transpose):
system.time({sorted3 <- a; for(i in seq_len(nrow(a))) sorted3[i,] <- sort.int(a[i,], method='quick') }) # 26 secs

But a better way should be to use the parallel package to sort parts of the matrix in parallel. However, the overhead of transferring data seems to be too big, and on my machine it starts swapping since I "only" have 8 GB memory:

library(parallel)
cl <- makeCluster(4)
system.time(sorted4 <- t(parApply(cl,a,1,sort.int, method='quick'))) # Forever...
stopCluster(cl)

The package grr contains an alternate sort method that can be used to speed up this particular operation (I have reduced the matrix size somewhat so that this benchmark doesn't take forever) :

> set.seed(1)
> a <- matrix(runif(9e+06),ncol=300)
> microbenchmark::microbenchmark(sorted <- t(apply(a,1,sort))
+                                ,sorted2 <- t(apply(a,1,sort.int, method='quick'))
+                                ,sorted3 <- t(apply(a,1,grr::sort2)),times=3,unit='s')
Unit: seconds
                                                  expr       min       lq     mean   median       uq      max neval
                        sorted <- t(apply(a, 1, sort)) 1.7699799 1.865829 1.961853 1.961678 2.057790 2.153902     3
 sorted2 <- t(apply(a, 1, sort.int, method = "quick")) 1.6162934 1.619922 1.694914 1.623551 1.734224 1.844898     3
                 sorted3 <- t(apply(a, 1, grr::sort2)) 0.9316073 1.003978 1.050569 1.076348 1.110049 1.143750     3

The difference becomes dramatic when the matrix contains characters:

> set.seed(1)
> a <- matrix(sample(letters,size = 9e6,replace = TRUE),ncol=300)
> microbenchmark::microbenchmark(sorted <- t(apply(a,1,sort))
+                                ,sorted2 <- t(apply(a,1,sort.int, method='quick'))
+                                ,sorted3 <- t(apply(a,1,grr::sort2)),times=3)
Unit: seconds
                                                  expr       min        lq      mean    median        uq      max neval
                        sorted <- t(apply(a, 1, sort)) 15.436045 15.479742 15.552009 15.523440 15.609991 15.69654     3
 sorted2 <- t(apply(a, 1, sort.int, method = "quick")) 15.099618 15.340577 15.447823 15.581536 15.621925 15.66231     3
                 sorted3 <- t(apply(a, 1, grr::sort2))  1.728663  1.733756  1.780737  1.738848  1.806774  1.87470     3

Results are identical for all three.

> identical(sorted,sorted2,sorted3)
[1] TRUE

Another excellent method from Martin Morgan without any usage of external packages in Fastest way to select i-th highest value from row and assign to new column :

matrix(a[order(row(a), a)], ncol=ncol(a), byrow=TRUE)

There is also an equivalent for sorting by columns under comments in the same link.

Timing code using same data as Craig:

set.seed(1)
a <- matrix(runif(9e7),ncol=300)

use_for <- function(){
    sorted3 <- a
    for(i in seq_len(nrow(a))) 
        sorted3[i,] <- sort.int(a[i,], method='quick') 
    sorted3
}

microbenchmark::microbenchmark(times=3L,
    t(apply(a,1,sort)),
    t(apply(a,1,sort.int, method='quick')),
    use_for(),
    Rfast::rowSort(a),
    t(apply(a,1,grr::sort2)),
    mmtd=matrix(a[order(row(a), a)], ncol=ncol(a), byrow=TRUE)
)

Timings:

Unit: seconds
                                       expr       min        lq      mean    median        uq       max neval
                       t(apply(a, 1, sort)) 24.233418 24.305339 24.389650 24.377260 24.467766 24.558272     3
 t(apply(a, 1, sort.int, method = "quick")) 17.024010 17.156722 17.524487 17.289433 17.774726 18.260019     3
                                  use_for() 13.384958 13.873367 14.131813 14.361776 14.505241 14.648705     3
                          Rfast::rowSort(a)  3.758765  4.607609  5.136865  5.456452  5.825914  6.195377     3
                 t(apply(a, 1, grr::sort2))  9.810774  9.955199 10.310328 10.099624 10.560106 11.020587     3
                                       mmtd  6.147010  6.177769  6.302549  6.208528  6.380318  6.552108     3

And to present a more complete picture, another test for character class (excluding Rfast::rowSort as it cannot handle character class):

set.seed(1)
a <- matrix(sample(letters, 9e6, TRUE),ncol=300)

microbenchmark::microbenchmark(times=1L,
    t(apply(a,1,sort)),
    t(apply(a,1,sort.int, method='quick')),
    use_for(),
    #Rfast::rowSort(a),
    t(apply(a,1,grr::sort2)),
    mmtd=matrix(a[order(row(a), a, method="radix")], ncol=ncol(a), byrow=TRUE)
)

Timings:

Unit: milliseconds
                                       expr        min         lq       mean     median         uq        max neval
                       t(apply(a, 1, sort)) 14848.4356 14848.4356 14848.4356 14848.4356 14848.4356 14848.4356     1
 t(apply(a, 1, sort.int, method = "quick")) 15061.0993 15061.0993 15061.0993 15061.0993 15061.0993 15061.0993     1
                                  use_for() 14144.1264 14144.1264 14144.1264 14144.1264 14144.1264 14144.1264     1
                 t(apply(a, 1, grr::sort2))  1831.1429  1831.1429  1831.1429  1831.1429  1831.1429  1831.1429     1
                                       mmtd   440.9158   440.9158   440.9158   440.9158   440.9158   440.9158     1

Head to head:

set.seed(1)
a <- matrix(sample(letters, 9e7, TRUE),ncol=300)
microbenchmark::microbenchmark(times=1L,
    t(apply(a,1,grr::sort2)),
    mmtd=matrix(a[order(row(a), a, method="radix")], ncol=ncol(a), byrow=TRUE)
)

Timings:

Unit: seconds
                       expr       min        lq      mean    median        uq       max neval
 t(apply(a, 1, grr::sort2)) 19.273225 19.273225 19.273225 19.273225 19.273225 19.273225     1
                       mmtd  3.854117  3.854117  3.854117  3.854117  3.854117  3.854117     1

R version:

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM