简体   繁体   English

R有效的索引匹配元素,聚类评估

[英]R efficient index matching elements, cluster evaluation

This may be a fairly esoteric question. 这可能是一个相当深奥的问题。

I'm trying to implement some of the ideas from Albatineh et al (2006) (DOI: 10.1007/s00357-006-0017-z) for a spatial clustering algorithm. 我正在尝试实施Albatineh等人(2006)(DOI:10.1007 / s00357-006-0017-z)的一些想法,以实现空间聚类算法。 The basic idea is one way to assess the stability of a clustering result is to examine how often pairs of observations end up in the same class. 基本思想是评估聚类结果稳定性的一种方法,是检查多对观察结果在同一类中出现的频率。 In a well defined solution pairs of observations should frequently end up in the same group. 在一个定义明确的解决方案中,成对的观察结果应该经常出现在同一组中。

The challenge is that in a large data set there are n^2 possible pairs (and most don't occur). 面临的挑战是,在大型数据集中,有n ^ 2个可能的对(并且大多数不会出现)。 We have structured our output as follows: 我们的输出结构如下:

A  B  C  C  A
B  A  A  A  B
A  B  C  C  A

Where the column index is the observation ID and each row represents a run from the clustering algorithm. 其中列索引是观察ID,每行代表来自聚类算法的运行。 In this example there are 5 observations and the algorithm was run 3 times. 在此示例中,有5个观察值,并且算法运行了3次。 The cluster labels A:C are essentially arbitrary between runs. 在运行之间,簇标签A:C本质上是任意的。 I'd like an efficient way to calculate something like this: 我想要一种有效的方法来计算如下内容:

ID1 ID2 
1    5
2   
3    4
4    3
5    1
1    2
2    3
2    4
...

This accomplishes my goal but is super slow, especially for a large data frame: 这实现了我的目标,但速度非常慢,尤其是对于大型数据帧:

testData <- matrix(data=sample(x=c("A", "B", "C"), 15, replace=TRUE), nrow=3)

cluPr <- function(pr.obs){
    pairs <- data.frame()
    for (row in 1:dim(pr.obs)[1]){
        for (ob in 1:dim(pr.obs)[2]){
            ob.pairs <- which(pr.obs[row,] %in% pr.obs[row,ob], arr.ind=TRUE)
            pairs <- rbind(pairs, cbind(ob, ob.pairs))
        }

    }
    return(pairs)   
}

cluPr(testData)

Here's a relatively quick approach using the combn() function. 这是一个使用combn()函数的相对较快的方法。 I assumed that the name of your matrix was m . 我假设您的矩阵名称为m

results <- t(combn(dim(m)[2], 2, function(x) c(x[1], x[2], sum(m[, x[1]] == m[, x[2]]))))
results2 <- results[results[, 3]>0, ]

Try this: 尝试这个:

clu.pairs <- function(k, row)
{
    w <- which(row==k)

    expand.grid(w, w)
}

row.pairs <- function(row)
{
    do.call(rbind, lapply(unique(row), function(k) clu.pairs(k, row)))
}

full.pairs <- function(data)
{
    do.call(rbind, lapply(seq_len(nrow(data)), function(i) row.pairs(data[i,])))
}

And use full.pairs(testData) . 并使用full.pairs(testData) The result is not in the same order as yours, but it's equivalent. 结果的顺序与您的顺序不同,但等效。

My first implementation ( not in R; my code is much faster in Java) of the pair counting metrics was with ordered generators, and then doing a merge-sort way of computing the intersection. 成对计数指标的第一个实现( 不是 R;我的代码在Java中要快得多)是使用有序生成器,然后采用合并排序的方式计算交集。 It was still on the order of O(n^2) run-time, but much lower in memory use. 它仍然是O(n^2)运行时的量级,但内存使用量要低得多。

However, you need to realize that you don't need to know the exact pairs. 但是,你需要认识到,你不需要知道确切的对。 You only need the quantity in the intersections, and that can be computed straightforward from the intersection matrix , just like most other similarity measures. 您只需要交点中的数量,就可以像大多数其他相似性度量一样,从交点矩阵中直接计算出该数量。 It's substantially faster if you only need to compute the set intersection sizes; 如果只需要计算设置的交集大小,则可以大大提高速度。 with hash tables, set intersection should be in O(n) . 对于散列表,设置交集应在O(n)

I don't have time to look it up; 我没有时间查一下。 but we may have touched this in the discussion of 但我们可能在讨论

Evaluation of Clusterings – Metrics and Visual Support 聚类评估–指标和可视化支持

Data Engineering (ICDE), 2012 IEEE 28th International Conference on 数据工程(ICDE),2012年IEEE第28届国际会议

Elke Achtert, Sascha Goldhofer, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek Elke Achtert,Sascha Goldhofer,Hans-Peter Kriegel,Erich Schubert,Arthur Zimek

where we demonstrated a visual tool to explore the pair-counting based measures, also for more than two clustering solutions (unfortunately, a visual inspection mostly works for toy data sets, not for real data which is usually too messy and high-dimensional). 在这里,我们展示了一种可视化工具来探索基于对数的度量,也适用于两个以上的聚类解决方案(不幸的是,可视化检查主要适用于玩具数据集,不适用于通常过于凌乱和高维的真实数据)。

Roughly said: try computing the values using the formulas on page 303 in the publication you cited, instead of computing and then counting the pairs as explained in the intuition/motivation! 粗略地说:尝试使用您引用的出版物中第303页的公式计算值,而不是按照直觉/动机中的解释计算并计算对!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM