[英]Updating individual elements of a sparse matrix in R is very slow - how can I do it faster?
The profvis
analysis of a script that uses a sparse matrix revealed that the update of the sparse matrix elements is the slowest step in the process, by 1 order of magnitude.对使用稀疏矩阵的脚本的
profvis
分析表明,稀疏矩阵元素的更新是该过程中最慢的步骤,相差 1 个数量级。
I need to understand if I can do this better (and especially faster);我需要了解我是否可以做得更好(尤其是更快); I would appreciate if someone could please advise where to look or provide suggestions.
如果有人可以建议在哪里寻找或提供建议,我将不胜感激。
Here is some R
code that reproduces the 'critical' part of my script:这是一些
R
代码,它重现了我的脚本的“关键”部分:
require(Matrix)
m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L,
7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L,
1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L,
9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L,
40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1,
3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1,
1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())
system.time(for (i in 1:10000) m[7,c(1,5,6,7)] <- c(0,0,1,0))
On my laptop, this takes about 7 s.在我的笔记本电脑上,这大约需要 7 秒。
[BTW, obviously I do not repeat the same operation 10000 times; [顺便说一句,显然我不会重复相同的操作 10000 次; the rows and columns that are updated change each time, but indeed it happens a very large number of times.
每次更新的行和列都会改变,但确实发生了很多次。 I did the above to simulate the operations that are performed in the real script and to get a measurable time that can be compared with faster solutions that might come up.]
我执行上述操作是为了模拟在真实脚本中执行的操作,并获得一个可测量的时间,该时间可以与可能出现的更快的解决方案进行比较。]
Any ideas/suggestions?有什么想法/建议吗?
PS附言
I had a similar problem in the past, but for a different case;我过去也遇到过类似的问题,但情况不同; and I can't find it back, as my activity history only goes back to a few months ago.
而且我找不到它,因为我的活动历史只能追溯到几个月前。
EDIT OK, I found out how to retrieve all my old posts, and discovered that the problem I am describing here was not covered.编辑好的,我发现了如何检索我所有的旧帖子,并发现我在这里描述的问题没有被涵盖。
EDIT 2 - following up from discussion with / suggestions by pseudospin编辑 2 - 跟进与 pseudospin 的讨论/建议
require(Matrix)
require(data.table)
m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L,
7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L,
1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L,
9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L,
40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1,
3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1,
1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())
ms <- summary(m)
ms <- ms[order(ms$i,ms$j),]
msdt <- data.table(ms)
time_1 <- system.time(for (i in 1:5000) m[7,c(1,5,7,9)] <- c(0,0,1,0))
cat("\ntime_1 =", time_1)
time_2 <- system.time(for (i in 1:5000) ms[(ms$i == 7) & (ms$j %in% c(1,5,7,9)),"x"] <- c(0,0,1,0))
cat("\ntime_2 =", time_2)
time_3 <- system.time(for (i in 1:5000) msdt[(i == 7) & (j %in% c(1,5,7,9)),"x" := c(0,0,1,0)])
cat("\ntime_3 =", time_3)
which gave:这给了:
time_1 = 2.86 0 2.86 NA NA
time_2 = 0.23 0 0.24 NA NA
time_3 = 1.2 0.02 1.22 NA NA
Maybe this example is misleading, though, because normally I would have much higher max values of i
and j
, so perhaps subsetting a data.table
will be much more efficient than subsetting a data.frame
.不过,也许这个示例具有误导性,因为通常我的
i
和j
的最大值会更高,因此对data.table
进行子集化可能比对data.frame
进行子集化更有效。
To be tested with my real data...用我的真实数据进行测试...
EDIT 3 - trial with real data, including testing the dense matrix method as suggested by GKi编辑 3 - 使用真实数据进行试验,包括测试 GKi 建议的密集矩阵方法
Real data (too large to paste here): m
is a sparse 5828 x 5828 matrix;真实数据(此处太大,无法粘贴):
m
是一个稀疏的 5828 x 5828 矩阵; 302986 / 33965584 = 0.9% of it is filled (hence its sparseness). 302986 / 33965584 = 0.9% 被填充(因此稀疏)。 It occupies 4.4 MB.
它占用 4.4 MB。 The corresponding dense matrix
dm = as.matrix(m)
occupies 272.5 MB.对应的密集矩阵
dm = as.matrix(m)
占用 272.5 MB。
Testing the sparseMatrix
(1), data.frame
(2), data.table
(3) and dense matrix (4) updating methods shows the following:测试
sparseMatrix
(1)、 data.frame
(2)、 data.table
(3)和dense matrix(4)更新方法显示如下:
time_1 = 10.25 3.19 13.72 NA NA
time_2 = 41.32 10.94 52.52 NA NA
time_3 = 35.64 7.44 43.34 NA NA
time_4 = 0.05 0.03 0.08 NA NA
So, in agreement with GKi's results, the dense matrix method is by far the fastest, but at the cost of huge memory storage.因此,与 GKi 的结果一致,密集矩阵方法是迄今为止最快的,但代价是巨大的 memory 存储。
On the other hand, the simulated data used originally gave very different results for the sparseMatrix
method, which with real data is instead the second fastest among these 4.另一方面,最初使用的模拟数据对
sparseMatrix
方法给出了非常不同的结果,而在这 4 种方法中,实际数据却是第二快的。
Unfortunately it looks like a catch-22 situation: to have fast editing I need to use a dense matrix, but a dense matrix takes far too much memory, so I need to use a sparse matrix, which however is slow to edit:(不幸的是,它看起来像一个 catch-22 情况:要进行快速编辑,我需要使用密集矩阵,但是密集矩阵需要太多 memory,所以我需要使用稀疏矩阵,但是编辑起来很慢:(
Maybe I will need to reconsider the original suggestion by pseudospin, and use sparse vectors for each row of the matrix.也许我需要重新考虑 pseudospin 的原始建议,并为矩阵的每一行使用稀疏向量。 For that I will need to find out how I can refer to a stored
R
object by an indirect reference (string).为此,我需要了解如何通过间接引用(字符串)来引用存储的
R
object。
A comparison of different methods.不同方法的比较。 Note that the methods DataFrame , DataTable and FastMatch will work only in case you are overwriting existing values in the sparse matrix but not when you insert new values.
请注意,方法DataFrame 、 DataTable和FastMatch仅在您覆盖稀疏矩阵中的现有值时有效,但在插入新值时无效。
library(microbenchmark)
microbenchmark(list = fun, control=list(order="block"))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# SparseMatrix 618.307 623.1795 639.38102 627.2525 643.3895 1021.301 100 e
# DenseMatrix 1.178 1.2210 1.36957 1.2635 1.3435 7.060 100 a
# Slam 259.703 264.3945 270.57151 265.9780 268.0745 426.610 100 c
# Spray 422.129 427.1310 463.21071 434.0705 440.6025 2880.787 100 d
# DataFrame 37.031 37.7910 38.98143 38.1660 38.6255 73.283 100 b
# DataFrameB 16.928 17.4480 17.85553 17.6910 18.0155 28.859 100 ab
# DataFrameC 21.007 21.7170 22.68689 21.9600 22.3735 38.175 100 ab
# DataTable 283.409 288.5710 299.43498 292.8395 301.1255 500.484 100 c
# FastMatch 40.138 40.7885 42.33623 41.2165 41.6575 82.274 100 b
# List 2.703 2.7900 3.28163 2.8535 2.9770 13.375 100 a
# Environment 2.157 2.2055 2.29915 2.2575 2.3340 4.211 100 a
library(bench)
mark(exprs = fun, check = FALSE)
## A tibble: 11 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
# 1 SparseMatrix 624.43µs 651.98µs 1487. 1.65KB 10.4 712 5
# 2 DenseMatrix 1.61µs 1.86µs 490003. 0B 49.0 9999 1
# 3 Slam 261.46µs 271.96µs 3606. 7.2KB 8.24 1750 4
# 4 Spray 424.08µs 439.86µs 2206. 8.16KB 10.4 1063 5
# 5 DataFrame 37.26µs 40.03µs 24378. 2.08KB 9.75 9996 4
# 6 DataFrameB 18.25µs 19.72µs 49447. 1.7KB 9.89 9998 2
# 7 DataFrameC 22.72µs 24.82µs 39142. 840B 11.7 9997 3
# 8 DataTable 288.24µs 300.25µs 3252. 18.34KB 8.24 1579 4
# 9 FastMatch 41.05µs 43.73µs 22292. 2.46KB 11.2 9995 5
#10 List 3.4µs 3.71µs 257225. 0B 25.7 9999 1
#11 Environment 2.82µs 3.11µs 306445. 0B 0 10000 0
Methods:方法:
fun <- alist(
SparseMatrix = m[7,c(1,5,6,7)] <- c(0,0,1,0)
, DenseMatrix = dm[7,c(1,5,6,7)] <- c(0,0,1,0)
, Slam = slm[7,c(1,5,6,7)] <- c(0,0,1,0)
, Spray = spm[7,c(1,5,6,7)] <- c(0,0,1,0)
, DataFrame = ms[(ms$j == 7) & (ms$i %in% c(1,5,6,7)),"x"] <- c(0,0,1,0)
, DataFrameB = ms$x[(ms$j == 7) & (ms$i %in% c(1,5,6,7))] <- c(0,0,1,0)
, DataFrameC = {i <- which(ms$j == 7); ms$x[i[ms$i[i] %in% c(1,5,6,7)]] <- c(0,0,1,0)}
, DataTable = msdt[(j == 7) & (i %in% c(1,5,6,7)),"x" := c(0,0,1,0)]
, FastMatch = mf[mf$j %fin% 7 & (mf$i %fin% c(1,5,6,7)),"x"] <- c(0,0,1,0)
, List = ml[["7"]][c("1","5","6","7")] <- c(0,0,1,0)
, Environment = me[["7"]][c("1","5","6","7")] <- c(0,0,1,0)
)
Data:数据:
library(Matrix)
m <- new("dgCMatrix", i = c(0L, 1L, 2L, 6L, 8L, 0L, 1L, 2L, 5L, 6L,
7L, 0L, 1L, 2L, 7L, 3L, 4L, 3L, 4L, 5L, 6L, 1L, 4L, 5L, 6L, 0L,
1L, 4L, 5L, 6L, 8L, 10L, 1L, 2L, 7L, 0L, 6L, 8L, 9L, 10L, 6L,
9L, 10L), p = c(0L, 5L, 11L, 15L, 17L, 21L, 25L, 32L, 35L, 38L,
40L, 43L), Dim = c(11L, 11L), Dimnames = list(c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11"), c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11")), x = c(2, 1, 1, 1, 1, 1,
3, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1,
1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2), factors = list())
dm <- as.matrix(m) #Dense Matrix
library(slam)
slm <- as.simple_sparse_array(dm)
library(spray)
spm <- as.spray(dm)
ms <- summary(m)
ms <- ms[order(ms$i,ms$j),]
library(data.table)
msdt <- data.table(ms)
library(fastmatch)
mf <- ms
ml <- split(setNames(ms$x, ms$j), ms$i)
me <- list2env(ml, hash = TRUE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.