R 中是否有对 dist 函数的稀疏支持？

Question

Have anyone heard about any package or functionality that works the same as the dist{stats} function from R which creates the有没有人听说过任何与 R 中的dist{stats}函数相同的包或功能，它创建了

distance matrix that is computed by using the specified distance measure to compute the distances between the rows of a data matrix,距离矩阵，通过使用指定的距离度量来计算数据矩阵的行之间的距离，

but take a sprase matrix as an input?但是将一个散列矩阵作为输入？

My data.frame (named dataCluster ) has dims: 7000 X 10000 and is almost 99% sparse.我的 data.frame（名为dataCluster ）有dataCluster ：7000 X 10000 并且几乎是 99% 稀疏。 In regular form that is not sparse this function doesn't seem to stop working...在不稀疏的常规形式中，此功能似乎不会停止工作......

h1 <- hclust( dist( dataCluster ) , method = "complete" )

Similar question without an answer: Sparse Matrix as input to Hierarchical clustering in R没有答案的类似问题： Sparse Matrix as input to Hierarchical clustering in R

Answer 1

You want wordspace::dist.matrix .你想要wordspace::dist.matrix 。

It accepts sparse matrices from the Matrix package (which isn't clear from the documentation) and can also do cross distances, output both Matrix and dist objects and more.它接受来自Matrix包的稀疏矩阵（文档中不清楚），还可以进行交叉距离，输出Matrix和dist对象等等。

The default distance measure is 'cosine' though, so be sure to specify method = 'euclidean' if you want that.但是，默认的距离度量是'cosine' ，因此如果需要，请务必指定method = 'euclidean' 。

Answer 2

**Update: ** You can do what qlcMatrix does quite easily in fact: **更新：**实际上你可以很容易地做 qlcMatrix 所做的事情：

sparse.cos <- function(x, y = NULL, drop = TRUE){
    if(!is.null(y)){
        if(class(x) != "dgCMatrix" || class(y) != "dgCMatrix") stop ("class(x) or class(y) != dgCMatrix")
        if(drop == TRUE) colnames(x) <- rownames(x) <- colnames(y) <- rownames(y) <- NULL
        crossprod(
            tcrossprod(
                x, 
                Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, x@Dim[1]))) ^ -0.5)
            ),
            tcrossprod(
                y, 
                Diagonal(x = as.vector(crossprod(y ^ 2, rep(1, x@Dim[1]))) ^ -0.5))
            )
        )
    } else {
        if(class(x) != "dgCMatrix") stop ("class(x) != dgCMatrix")
        if(drop == TRUE) colnames(x) <- rownames(X) <- NULL
        crossprod(
            tcrossprod(
                x,
                Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, nrow(x)))) ^ -0.5))
        )
    }
}

I can find no significant difference in performance between the above and qlcMatrix::cosSparse .我发现上述和qlcMatrix::cosSparse之间的性能没有显着差异。

qlcMatrix::cosSparse is faster than wordspace::dist.matrix when data is >50% sparse or the similarity is being calculated on the longest edge of the input matrix (ie tall format). qlcMatrix::cosSparse比wordspace::dist.matrix更快，当数据 > 50% 稀疏或在输入矩阵的最长边（即高格式）上计算相似性时。

Performance of wordspace::dist.matrix vs. qlcMatrix::cosSparse on a wide matrix (1000 x 5000) of varying sparsity (10%, 50%, 90%, or 99% sparse) to calculate a 1000 x 1000 similarity: wordspace::dist.matrix与qlcMatrix::cosSparse在不同稀疏度（10%、50%、90% 或 99% 稀疏）的宽矩阵 (1000 x 5000) 上的性能，以计算 1000 x 1000 相似度：

# M1 is 10% sparse, M99 is 99% sparse
set.seed(123)
M10 <- rsparsematrix(5000, 1000, density = 1)
M50 <- rsparsematrix(5000, 1000, density = 0.5)
M90 <- rsparsematrix(5000, 1000, density = 0.1)
M99 <- rsparsematrix(5000, 1000, density = 0.01)
tM10 <- t(M10)
tM50 <- t(M50)
tM90 <- t(M90)
tM99 <- t(M99)
benchmark(
 "cosSparse: 10% sparse" = cosSparse(M10),
 "cosSparse: 50% sparse" = cosSparse(M50),
 "cosSparse: 90% sparse" = cosSparse(M90),
 "cosSparse: 99% sparse" = cosSparse(M99),
 "wordspace: 10% sparse" = dist.matrix(tM10, byrow = TRUE),
 "wordspace: 50% sparse" = dist.matrix(tM50, byrow = TRUE),
 "wordspace: 90% sparse" = dist.matrix(tM90, byrow = TRUE),
 "wordspace: 99% sparse" = dist.matrix(tM99, byrow = TRUE),
 replications = 2, columns = c("test", "elapsed", "relative"))

The two functions are quite comparable, with wordspace taking a slight lead at lower sparsity, but definitely not at high sparsity:这两个函数具有相当的可比性，wordspace 在较低稀疏度时略微领先，但在高度稀疏度时绝对不是：

                   test elapsed relative
1 cosSparse: 10% sparse   15.83  527.667
2 cosSparse: 50% sparse    4.72  157.333
3 cosSparse: 90% sparse    0.31   10.333
4 cosSparse: 99% sparse    0.03    1.000
5 wordspace: 10% sparse   15.23  507.667
6 wordspace: 50% sparse    4.28  142.667
7 wordspace: 90% sparse    0.36   12.000
8 wordspace: 99% sparse    0.09    3.000

If we flip the calculation around to compute a 5000 x 5000 matrix, then:如果我们翻转计算以计算 5000 x 5000 矩阵，则：

benchmark(
 "cosSparse: 50% sparse" = cosSparse(tM50),
 "cosSparse: 90% sparse" = cosSparse(tM90),
 "cosSparse: 99% sparse" = cosSparse(tM99),
 "wordspace: 50% sparse" = dist.matrix(M50, byrow = TRUE),
 "wordspace: 90% sparse" = dist.matrix(M90, byrow = TRUE),
 "wordspace: 99% sparse" = dist.matrix(M99, byrow = TRUE),
 replications = 1, columns = c("test", "elapsed", "relative"))

Now the competitive advantage of cosSparse becomes very clear:现在 cosSparse 的竞争优势变得非常明显：

                   test elapsed relative
1 cosSparse: 50% sparse   10.58  151.143
2 cosSparse: 90% sparse    1.44   20.571
3 cosSparse: 99% sparse    0.07    1.000
4 wordspace: 50% sparse   11.41  163.000
5 wordspace: 90% sparse    2.39   34.143
6 wordspace: 99% sparse    0.64    9.143

The change in efficiency is not very dramatic at 50% sparsity, but at 90% sparsity, wordspace is 1.6x slower, and at 99% sparsity it's nearly 10x slower!效率的变化在 50% 稀疏度下不是很显着，但是在 90% 稀疏度下，词空间慢 1.6 倍，而在 99% 稀疏度下，它慢了近 10 倍！

Compare this performance to a square matrix:将此性能与方阵进行比较：

M50.square <- rsparsematrix(1000, 1000, density = 0.5)
tM50.square <- t(M50.square)
M90.square <- rsparsematrix(1000, 1000, density = 0.1)
tM90.square <- t(M90.square)

benchmark(
 "cosSparse: square, 50% sparse" = cosSparse(M50.square),
 "wordspace: square, 50% sparse" = dist.matrix(tM50.square, byrow = TRUE),
 "cosSparse: square, 90% sparse" = cosSparse(M90.square),
 "wordspace: square, 90% sparse" = dist.matrix(tM90.square, byrow = TRUE),
 replications = 5, columns = c("test", "elapsed", "relative"))

cosSparse is marginally faster at both 50% sparsity, and almost twice as fast at 90% sparsity! cosSparse 在稀疏度为 50% 时略快，在稀疏度为 90% 时几乎快两倍！

                           test elapsed relative
1 cosSparse: square, 50% sparse    2.12    9.217
3 cosSparse: square, 90% sparse    0.23    1.000
2 wordspace: square, 50% sparse    2.15    9.348
4 wordspace: square, 90% sparse    0.40    1.739

Note that the wordspace::dist.matrix has more edge case checks than qlcMatrix::cosSparse and also permits parallelization through openmp in R. Also, wordspace::dist.matrix supports euclidean and jaccard distance measures, although these are far slower.注意， wordspace::dist.matrix具有多个边缘的情况下的检查比qlcMatrix::cosSparse和通过还允许并行openmp在R.另外， wordspace::dist.matrix支撑欧几里德和杰卡德距离度量，虽然这些是远慢。 There are a number of other handy features built into that package.该软件包中还内置了许多其他方便的功能。

That said, if you only need cosine similarity, and your matrix is >50% sparse, and you're computing the tall way, cosSparse should be the tool of choice.也就是说，如果您只需要余弦相似度，并且您的矩阵 > 50% 稀疏，并且您正在计算高大的方式， cosSparse应该是首选工具。

R 中是否有对 dist 函数的稀疏支持？

问题描述

2 个解决方案

解决方案1
8 已采纳 2017-05-17 09:05:08

解决方案2
1 2020-11-21 14:23:14

R 中是否有对 dist 函数的稀疏支持？

问题描述

2 个解决方案

解决方案1 8 已采纳 2017-05-17 09:05:08

解决方案2 1 2020-11-21 14:23:14

解决方案1
8 已采纳 2017-05-17 09:05:08

解决方案2
1 2020-11-21 14:23:14