[英]Is there any sparse support for dist function in R?
Have anyone heard about any package or functionality that works the same as the dist{stats}
function from R which creates the有没有人听说过任何与 R 中的dist{stats}
函数相同的包或功能,它创建了
distance matrix that is computed by using the specified distance measure to compute the distances between the rows of a data matrix,距离矩阵,通过使用指定的距离度量来计算数据矩阵的行之间的距离,
but take a sprase matrix as an input?但是将一个散列矩阵作为输入?
My data.frame (named dataCluster
) has dims: 7000 X 10000 and is almost 99% sparse.我的 data.frame(名为dataCluster
)有dataCluster
:7000 X 10000 并且几乎是 99% 稀疏。 In regular form that is not sparse this function doesn't seem to stop working...在不稀疏的常规形式中,此功能似乎不会停止工作......
h1 <- hclust( dist( dataCluster ) , method = "complete" )
Similar question without an answer: Sparse Matrix as input to Hierarchical clustering in R没有答案的类似问题: Sparse Matrix as input to Hierarchical clustering in R
You want wordspace::dist.matrix
.你想要wordspace::dist.matrix
。
It accepts sparse matrices from the Matrix
package (which isn't clear from the documentation) and can also do cross distances, output both Matrix
and dist
objects and more.它接受来自Matrix
包的稀疏矩阵(文档中不清楚),还可以进行交叉距离,输出Matrix
和dist
对象等等。
The default distance measure is 'cosine'
though, so be sure to specify method = 'euclidean'
if you want that.但是,默认的距离度量是'cosine'
,因此如果需要,请务必指定method = 'euclidean'
。
**Update: ** You can do what qlcMatrix does quite easily in fact: **更新:**实际上你可以很容易地做 qlcMatrix 所做的事情:
sparse.cos <- function(x, y = NULL, drop = TRUE){
if(!is.null(y)){
if(class(x) != "dgCMatrix" || class(y) != "dgCMatrix") stop ("class(x) or class(y) != dgCMatrix")
if(drop == TRUE) colnames(x) <- rownames(x) <- colnames(y) <- rownames(y) <- NULL
crossprod(
tcrossprod(
x,
Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, x@Dim[1]))) ^ -0.5)
),
tcrossprod(
y,
Diagonal(x = as.vector(crossprod(y ^ 2, rep(1, x@Dim[1]))) ^ -0.5))
)
)
} else {
if(class(x) != "dgCMatrix") stop ("class(x) != dgCMatrix")
if(drop == TRUE) colnames(x) <- rownames(X) <- NULL
crossprod(
tcrossprod(
x,
Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, nrow(x)))) ^ -0.5))
)
}
}
I can find no significant difference in performance between the above and qlcMatrix::cosSparse
.我发现上述和qlcMatrix::cosSparse
之间的性能没有显着差异。
qlcMatrix::cosSparse
is faster than wordspace::dist.matrix
when data is >50% sparse or the similarity is being calculated on the longest edge of the input matrix (ie tall format). qlcMatrix::cosSparse
比wordspace::dist.matrix
更快,当数据 > 50% 稀疏或在输入矩阵的最长边(即高格式)上计算相似性时。
Performance of wordspace::dist.matrix
vs. qlcMatrix::cosSparse
on a wide matrix (1000 x 5000) of varying sparsity (10%, 50%, 90%, or 99% sparse) to calculate a 1000 x 1000 similarity: wordspace::dist.matrix
与qlcMatrix::cosSparse
在不同稀疏度(10%、50%、90% 或 99% 稀疏)的宽矩阵 (1000 x 5000) 上的性能,以计算 1000 x 1000 相似度:
# M1 is 10% sparse, M99 is 99% sparse
set.seed(123)
M10 <- rsparsematrix(5000, 1000, density = 1)
M50 <- rsparsematrix(5000, 1000, density = 0.5)
M90 <- rsparsematrix(5000, 1000, density = 0.1)
M99 <- rsparsematrix(5000, 1000, density = 0.01)
tM10 <- t(M10)
tM50 <- t(M50)
tM90 <- t(M90)
tM99 <- t(M99)
benchmark(
"cosSparse: 10% sparse" = cosSparse(M10),
"cosSparse: 50% sparse" = cosSparse(M50),
"cosSparse: 90% sparse" = cosSparse(M90),
"cosSparse: 99% sparse" = cosSparse(M99),
"wordspace: 10% sparse" = dist.matrix(tM10, byrow = TRUE),
"wordspace: 50% sparse" = dist.matrix(tM50, byrow = TRUE),
"wordspace: 90% sparse" = dist.matrix(tM90, byrow = TRUE),
"wordspace: 99% sparse" = dist.matrix(tM99, byrow = TRUE),
replications = 2, columns = c("test", "elapsed", "relative"))
The two functions are quite comparable, with wordspace taking a slight lead at lower sparsity, but definitely not at high sparsity:这两个函数具有相当的可比性,wordspace 在较低稀疏度时略微领先,但在高度稀疏度时绝对不是:
test elapsed relative
1 cosSparse: 10% sparse 15.83 527.667
2 cosSparse: 50% sparse 4.72 157.333
3 cosSparse: 90% sparse 0.31 10.333
4 cosSparse: 99% sparse 0.03 1.000
5 wordspace: 10% sparse 15.23 507.667
6 wordspace: 50% sparse 4.28 142.667
7 wordspace: 90% sparse 0.36 12.000
8 wordspace: 99% sparse 0.09 3.000
If we flip the calculation around to compute a 5000 x 5000 matrix, then:如果我们翻转计算以计算 5000 x 5000 矩阵,则:
benchmark(
"cosSparse: 50% sparse" = cosSparse(tM50),
"cosSparse: 90% sparse" = cosSparse(tM90),
"cosSparse: 99% sparse" = cosSparse(tM99),
"wordspace: 50% sparse" = dist.matrix(M50, byrow = TRUE),
"wordspace: 90% sparse" = dist.matrix(M90, byrow = TRUE),
"wordspace: 99% sparse" = dist.matrix(M99, byrow = TRUE),
replications = 1, columns = c("test", "elapsed", "relative"))
Now the competitive advantage of cosSparse becomes very clear:现在 cosSparse 的竞争优势变得非常明显:
test elapsed relative
1 cosSparse: 50% sparse 10.58 151.143
2 cosSparse: 90% sparse 1.44 20.571
3 cosSparse: 99% sparse 0.07 1.000
4 wordspace: 50% sparse 11.41 163.000
5 wordspace: 90% sparse 2.39 34.143
6 wordspace: 99% sparse 0.64 9.143
The change in efficiency is not very dramatic at 50% sparsity, but at 90% sparsity, wordspace is 1.6x slower, and at 99% sparsity it's nearly 10x slower!效率的变化在 50% 稀疏度下不是很显着,但是在 90% 稀疏度下,词空间慢 1.6 倍,而在 99% 稀疏度下,它慢了近 10 倍!
Compare this performance to a square matrix:将此性能与方阵进行比较:
M50.square <- rsparsematrix(1000, 1000, density = 0.5)
tM50.square <- t(M50.square)
M90.square <- rsparsematrix(1000, 1000, density = 0.1)
tM90.square <- t(M90.square)
benchmark(
"cosSparse: square, 50% sparse" = cosSparse(M50.square),
"wordspace: square, 50% sparse" = dist.matrix(tM50.square, byrow = TRUE),
"cosSparse: square, 90% sparse" = cosSparse(M90.square),
"wordspace: square, 90% sparse" = dist.matrix(tM90.square, byrow = TRUE),
replications = 5, columns = c("test", "elapsed", "relative"))
cosSparse is marginally faster at both 50% sparsity, and almost twice as fast at 90% sparsity! cosSparse 在稀疏度为 50% 时略快,在稀疏度为 90% 时几乎快两倍!
test elapsed relative
1 cosSparse: square, 50% sparse 2.12 9.217
3 cosSparse: square, 90% sparse 0.23 1.000
2 wordspace: square, 50% sparse 2.15 9.348
4 wordspace: square, 90% sparse 0.40 1.739
Note that the wordspace::dist.matrix
has more edge case checks than qlcMatrix::cosSparse
and also permits parallelization through openmp
in R. Also, wordspace::dist.matrix
supports euclidean and jaccard distance measures, although these are far slower.注意, wordspace::dist.matrix
具有多个边缘的情况下的检查比qlcMatrix::cosSparse
和通过还允许并行openmp
在R.另外, wordspace::dist.matrix
支撑欧几里德和杰卡德距离度量,虽然这些是远慢。 There are a number of other handy features built into that package.该软件包中还内置了许多其他方便的功能。
That said, if you only need cosine similarity, and your matrix is >50% sparse, and you're computing the tall way, cosSparse
should be the tool of choice.也就是说,如果您只需要余弦相似度,并且您的矩阵 > 50% 稀疏,并且您正在计算高大的方式, cosSparse
应该是首选工具。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.