简体   繁体   English

更新 R 中稀疏矩阵的单个元素非常慢 - 我怎样才能更快地做到这一点?

[英]Updating individual elements of a sparse matrix in R is very slow - how can I do it faster?

The profvis analysis of a script that uses a sparse matrix revealed that the update of the sparse matrix elements is the slowest step in the process, by 1 order of magnitude.对使用稀疏矩阵的脚本的profvis分析表明,稀疏矩阵元素的更新是该过程中最慢的步骤,相差 1 个数量级。
I need to understand if I can do this better (and especially faster);我需要了解我是否可以做得更好(尤其是更快); I would appreciate if someone could please advise where to look or provide suggestions.如果有人可以建议在哪里寻找或提供建议,我将不胜感激。

Here is some R code that reproduces the 'critical' part of my script:这是一些R代码,它重现了我的脚本的“关键”部分:

require(Matrix)
m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L, 
7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L, 
1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L, 
9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L, 
40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3", 
"4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1, 
3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 
1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())

system.time(for (i in 1:10000) m[7,c(1,5,6,7)] <- c(0,0,1,0))

On my laptop, this takes about 7 s.在我的笔记本电脑上,这大约需要 7 秒。
[BTW, obviously I do not repeat the same operation 10000 times; [顺便说一句,显然我不会重复相同的操作 10000 次; the rows and columns that are updated change each time, but indeed it happens a very large number of times.每次更新的行和列都会改变,但确实发生了很多次。 I did the above to simulate the operations that are performed in the real script and to get a measurable time that can be compared with faster solutions that might come up.]我执行上述操作是为了模拟在真实脚本中执行的操作,并获得一个可测量的时间,该时间可以与可能出现的更快的解决方案进行比较。]

Any ideas/suggestions?有什么想法/建议吗?

PS附言
I had a similar problem in the past, but for a different case;我过去也遇到过类似的问题,但情况不同; and I can't find it back, as my activity history only goes back to a few months ago.而且我找不到它,因为我的活动历史只能追溯到几个月前。

EDIT OK, I found out how to retrieve all my old posts, and discovered that the problem I am describing here was not covered.编辑好的,我发现了如何检索我所有的旧帖子,并发现我在这里描述的问题没有被涵盖。

EDIT 2 - following up from discussion with / suggestions by pseudospin编辑 2 - 跟进与 pseudospin 的讨论/建议

require(Matrix)
require(data.table)

m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L, 
                            7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L, 
                            1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L, 
                            9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L, 
                                            40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3", 
                                                                                            "4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4", 
                                                                                                                                         "5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1, 
                                                                                                                                                                                      3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 
                                                                                                                                                                                      1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())

ms <- summary(m)
ms <- ms[order(ms$i,ms$j),]
msdt <- data.table(ms)

time_1 <- system.time(for (i in 1:5000) m[7,c(1,5,7,9)] <- c(0,0,1,0))
cat("\ntime_1 =", time_1)

time_2 <- system.time(for (i in 1:5000) ms[(ms$i == 7) & (ms$j %in% c(1,5,7,9)),"x"] <- c(0,0,1,0))
cat("\ntime_2 =", time_2)

time_3 <- system.time(for (i in 1:5000) msdt[(i == 7) & (j %in% c(1,5,7,9)),"x" := c(0,0,1,0)])
cat("\ntime_3 =", time_3)

which gave:这给了:

time_1 = 2.86 0 2.86 NA NA
time_2 = 0.23 0 0.24 NA NA
time_3 = 1.2 0.02 1.22 NA NA

Maybe this example is misleading, though, because normally I would have much higher max values of i and j , so perhaps subsetting a data.table will be much more efficient than subsetting a data.frame .不过,也许这个示例具有误导性,因为通常我的ij的最大值会更高,因此对data.table进行子集化可能比对data.frame进行子集化更有效。
To be tested with my real data...用我的真实数据进行测试...

EDIT 3 - trial with real data, including testing the dense matrix method as suggested by GKi编辑 3 - 使用真实数据进行试验,包括测试 GKi 建议的密集矩阵方法

Real data (too large to paste here): m is a sparse 5828 x 5828 matrix;真实数据(此处太大,无法粘贴): m是一个稀疏的 5828 x 5828 矩阵; 302986 / 33965584 = 0.9% of it is filled (hence its sparseness). 302986 / 33965584 = 0.9% 被填充(因此稀疏)。 It occupies 4.4 MB.它占用 4.4 MB。 The corresponding dense matrix dm = as.matrix(m) occupies 272.5 MB.对应的密集矩阵dm = as.matrix(m)占用 272.5 MB。

Testing the sparseMatrix (1), data.frame (2), data.table (3) and dense matrix (4) updating methods shows the following:测试sparseMatrix (1)、 data.frame (2)、 data.table (3)和dense matrix(4)更新方法显示如下:

time_1 = 10.25 3.19 13.72 NA NA
time_2 = 41.32 10.94 52.52 NA NA
time_3 = 35.64 7.44 43.34 NA NA
time_4 = 0.05 0.03 0.08 NA NA

So, in agreement with GKi's results, the dense matrix method is by far the fastest, but at the cost of huge memory storage.因此,与 GKi 的结果一致,密集矩阵方法是迄今为止最快的,但代价是巨大的 memory 存储。
On the other hand, the simulated data used originally gave very different results for the sparseMatrix method, which with real data is instead the second fastest among these 4.另一方面,最初使用的模拟数据对sparseMatrix方法给出了非常不同的结果,而在这 4 种方法中,实际数据却是第二快的。

Unfortunately it looks like a catch-22 situation: to have fast editing I need to use a dense matrix, but a dense matrix takes far too much memory, so I need to use a sparse matrix, which however is slow to edit:(不幸的是,它看起来像一个 catch-22 情况:要进行快速编辑,我需要使用密集矩阵,但是密集矩阵需要太多 memory,所以我需要使用稀疏矩阵,但是编辑起来很慢:(

Maybe I will need to reconsider the original suggestion by pseudospin, and use sparse vectors for each row of the matrix.也许我需要重新考虑 pseudospin 的原始建议,并为矩阵的每一行使用稀疏向量。 For that I will need to find out how I can refer to a stored R object by an indirect reference (string).为此,我需要了解如何通过间接引用(字符串)来引用存储的R object。

A comparison of different methods.不同方法的比较。 Note that the methods DataFrame , DataTable and FastMatch will work only in case you are overwriting existing values in the sparse matrix but not when you insert new values.请注意,方法DataFrameDataTableFastMatch仅在您覆盖稀疏矩阵中的现有值时有效,但在插入新值时无效。

library(microbenchmark)
microbenchmark(list = fun, control=list(order="block"))
#Unit: microseconds
#         expr     min       lq      mean   median       uq      max neval   cld
# SparseMatrix 618.307 623.1795 639.38102 627.2525 643.3895 1021.301   100     e
#  DenseMatrix   1.178   1.2210   1.36957   1.2635   1.3435    7.060   100 a    
#         Slam 259.703 264.3945 270.57151 265.9780 268.0745  426.610   100   c  
#        Spray 422.129 427.1310 463.21071 434.0705 440.6025 2880.787   100    d 
#    DataFrame  37.031  37.7910  38.98143  38.1660  38.6255   73.283   100  b   
#   DataFrameB  16.928  17.4480  17.85553  17.6910  18.0155   28.859   100 ab   
#   DataFrameC  21.007  21.7170  22.68689  21.9600  22.3735   38.175   100 ab   
#    DataTable 283.409 288.5710 299.43498 292.8395 301.1255  500.484   100   c  
#    FastMatch  40.138  40.7885  42.33623  41.2165  41.6575   82.274   100  b   
#         List   2.703   2.7900   3.28163   2.8535   2.9770   13.375   100 a    
#  Environment   2.157   2.2055   2.29915   2.2575   2.3340    4.211   100 a    

library(bench)
mark(exprs = fun, check = FALSE)
## A tibble: 11 x 13
#   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
# 1 SparseMatrix 624.43µs 651.98µs     1487.    1.65KB    10.4    712     5
# 2 DenseMatrix    1.61µs   1.86µs   490003.        0B    49.0   9999     1
# 3 Slam         261.46µs 271.96µs     3606.     7.2KB     8.24  1750     4
# 4 Spray        424.08µs 439.86µs     2206.    8.16KB    10.4   1063     5
# 5 DataFrame     37.26µs  40.03µs    24378.    2.08KB     9.75  9996     4
# 6 DataFrameB    18.25µs  19.72µs    49447.     1.7KB     9.89  9998     2
# 7 DataFrameC    22.72µs  24.82µs    39142.      840B    11.7   9997     3
# 8 DataTable    288.24µs 300.25µs     3252.   18.34KB     8.24  1579     4
# 9 FastMatch     41.05µs  43.73µs    22292.    2.46KB    11.2   9995     5
#10 List            3.4µs   3.71µs   257225.        0B    25.7   9999     1
#11 Environment    2.82µs   3.11µs   306445.        0B     0    10000     0

Methods:方法:

fun <- alist(
  SparseMatrix = m[7,c(1,5,6,7)] <- c(0,0,1,0)
, DenseMatrix = dm[7,c(1,5,6,7)] <- c(0,0,1,0)
, Slam = slm[7,c(1,5,6,7)] <- c(0,0,1,0)
, Spray = spm[7,c(1,5,6,7)] <- c(0,0,1,0)
, DataFrame = ms[(ms$j == 7) & (ms$i %in% c(1,5,6,7)),"x"] <- c(0,0,1,0)
, DataFrameB = ms$x[(ms$j == 7) & (ms$i %in% c(1,5,6,7))] <- c(0,0,1,0)
, DataFrameC = {i <- which(ms$j == 7); ms$x[i[ms$i[i] %in% c(1,5,6,7)]] <- c(0,0,1,0)}
, DataTable = msdt[(j == 7) & (i %in% c(1,5,6,7)),"x" := c(0,0,1,0)]
, FastMatch = mf[mf$j %fin% 7 & (mf$i %fin% c(1,5,6,7)),"x"] <- c(0,0,1,0)
, List = ml[["7"]][c("1","5","6","7")] <- c(0,0,1,0)
, Environment = me[["7"]][c("1","5","6","7")] <- c(0,0,1,0)
)

Data:数据:

library(Matrix)
m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L, 
7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L, 
1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L, 
9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L, 
40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3", 
"4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1, 
3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 
1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())

dm <- as.matrix(m) #Dense Matrix

library(slam)
slm <- as.simple_sparse_array(dm)

library(spray)
spm <- as.spray(dm)

ms <- summary(m)
ms <- ms[order(ms$i,ms$j),]

library(data.table)
msdt <- data.table(ms)

library(fastmatch)
mf <- ms

ml <- split(setNames(ms$x, ms$j), ms$i)

me <- list2env(ml, hash = TRUE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM