为什么在大型稀疏矩阵上提取 R 行比将其分成更小的部分然后提取更慢？

Question

I am working with a 19089 x 9432 sparse Matrix of class "dgCMatrix" (let's call it M ), and I have to extract each row to perform some calculations on it.我正在使用“dgCMatrix”类的 19089 x 9432 稀疏矩阵（我们称之为M ），我必须提取每一行以对其执行一些计算。 I do this with a loop, so at each iteration I have to do something like currentrow <- M[i,] to apply the body of the loop on it.我用一个循环来做到这一点，所以在每次迭代时，我必须做一些类似currentrow <- M[i,]事情来在它上面应用循环体。 The calculations are very time consuming so I want to optimize it as much as possible, and I realized that if I first divide my matrix in small pieces ( M[1:100,] , M[101:200,] , etc...), and that I do a loop on each of those smaller matrices (therefore calling currentrow <- current_smallM[i,] at each iteration), the loop is much more faster.计算非常耗时，所以我想尽可能地优化它，我意识到如果我首先将我的矩阵分成小块（ M[1:100,] 、 M[101:200,]等。 .)，并且我对每个较小的矩阵进行循环（因此在每次迭代时调用currentrow <- current_smallM[i,] ），循环要快得多。

Here is a code example that I run to reproduce this :这是我运行以重现此代码示例：

library(Matrix)

N = 10000
M = 5000

# Creation of the large matrix (of class dgCMatrix)
largeMatrix <- Matrix(rnorm(N*M,mean=0,sd=1), byrow = TRUE, nrow = N, sparse = TRUE)


# We take into account the time for the creation of the smaller matrix, and then calculate the time to allocate the 200 rows to a variable
start.time = Sys.time()
smallMatrix = largeMatrix[100:200,]
for (i in 1:100){
    test <- smallMatrix[i,]
}
end.time = Sys.time()
print(end.time - start.time) # 0.47 secs


# Same allocations but working on the large matrix
start.time = Sys.time()
for (i in 100:200){
    test <- largeMatrix[i,]
}
end.time = Sys.time()
print(end.time - start.time) # 18.44 secs

You can see that the time difference is really huge... So I am really wondering :你可以看到时差真的很大......所以我真的很想知道：

Why is this happening?为什么会这样？
Is there a more efficient way to store my data than dividing my matrix in smaller pieces?有没有比将我的矩阵分成更小的部分更有效的方法来存储我的数据？

Interestingly, I tested the same code with a matrix object (using largeMatrix <- matrix( rnorm(N*M,mean=0,sd=1), N, M) ), and the results were totally different : 0.06 secs for the divided matrix, and 0.04 secs for the large matrix, so I am really wondering what is different with the sparse representation.有趣的是，我用matrix对象测试了相同的代码（使用largeMatrix <- matrix( rnorm(N*M,mean=0,sd=1), N, M) ），结果完全不同：0.06 秒分割矩阵，大矩阵为 0.04 秒，所以我真的想知道稀疏表示有什么不同。

Note: I found a quite similar question here but it was with a different language, and (I think that) the solution does not apply here because it was due to an implicite type conversion, whereas here I am just extracting a row.注意：我在这里发现了一个非常相似的问题，但它使用的是不同的语言，并且（我认为）解决方案在这里不适用，因为它是由于隐式类型转换，而在这里我只是提取一行。

Thank you for your help!感谢您的帮助！

Answer 1

dgCMatrix is a compressed sparse column format . dgCMatrix 是一种压缩的稀疏列格式。 It has an indptr array that's got X entries where X is the number of columns in the matrix, and an index array that identifies the location of each non-zero value, which has N entries where N is the number of non-zero values in the array.它有一个indptr数组，其中包含 X 个条目，其中 X 是矩阵中的列数，以及一个标识每个非零值位置的index数组，其中包含 N 个条目，其中 N 是矩阵中非零值的数量数组。

What this means is that every time you want to slice it by row it needs to traverse the entire index array and look for values that are within the range you want to slice.这意味着每次您想逐行切片时，它都需要遍历整个index数组并查找您想要切片的范围内的值。 Using this smallMatrix = largeMatrix[100:200,] slices the array into a smaller array where the index array is much smaller, and can therefore be traversed more quickly.使用这个smallMatrix = largeMatrix[100:200,]将数组切片成一个较小的数组，其中index数组要小得多，因此可以更快地遍历。

Your real problem is that you're trying to get rows out of a data structure that gives you very inefficient slicing for rows and very efficient slicing for columns.您真正的问题是您试图从数据结构中获取行，这使您对行进行切片效率非常低，而对列进行切片效率很高。

为什么在大型稀疏矩阵上提取 R 行比将其分成更小的部分然后提取更慢？

问题描述

1 个解决方案

解决方案1
4 已采纳 2021-07-12 17:28:15

为什么在大型稀疏矩阵上提取 R 行比将其分成更小的部分然后提取更慢？

问题描述

1 个解决方案

解决方案1 4 已采纳 2021-07-12 17:28:15

解决方案1
4 已采纳 2021-07-12 17:28:15