简体   繁体   English

迭代后从HashSet中删除失败

[英]Remove from a HashSet failing after iterating over it

I'm writing an agglomerative clustering algorithm in java and having trouble with a remove operation. 我在java中编写了一个凝聚聚类算法,并且在删除操作时遇到了问题。 It seems to always fail when the number of clusters reaches half the initial number. 当簇的数量达到初始数量的一半时,它似乎总是失败。

In the sample code below, clusters is a Collection<Collection<Integer>> . 在下面的示例代码中, clustersCollection<Collection<Integer>>

      while(clusters.size() > K){
           // determine smallest distance between clusters
           Collection<Integer> minclust1 = null;
           Collection<Integer> minclust2 = null;
           double mindist = Double.POSITIVE_INFINITY;

           for(Collection<Integer> cluster1 : clusters){
                for(Collection<Integer> cluster2 : clusters){
                     if( cluster1 != cluster2 && getDistance(cluster1, cluster2) < mindist){
                          minclust1 = cluster1;
                          minclust2 = cluster2;
                          mindist = getDistance(cluster1, cluster2);
                     }
                }
           }

           // merge the two clusters
           minclust1.addAll(minclust2);
           clusters.remove(minclust2);
      }

After a few runs through the loop, clusters.remove(minclust2) eventually returns false, but I don't understand why. 经过循环几次后, clusters.remove(minclust2)最终返回false,但我不明白为什么。

I tested this code by first creating 10 clusters, each with one integer from 1 to 10. Distances are random numbers between 0 and 1. Here's the output after adding a few println statements. 我通过首先创建10个簇来测试这个代码,每个簇有一个从1到10的整数。距离是0到1之间的随机数。这是添加一些println语句后的输出。 After the number of clusters, I print out the actual clusters, the merge operation, and the result of clusters.remove(minclust2). 在簇数之后,我打印出实际的簇,合并操作以及clusters.remove(minclust2)的结果。

Clustering: 10 clusters
[[3], [1], [10], [5], [9], [7], [2], [4], [6], [8]]
[5] <- [6]
true
Clustering: 9 clusters
[[3], [1], [10], [5, 6], [9], [7], [2], [4], [8]]
[7] <- [8]
true
Clustering: 8 clusters
[[3], [1], [10], [5, 6], [9], [7, 8], [2], [4]]
[10] <- [9]
true
Clustering: 7 clusters
[[3], [1], [10, 9], [5, 6], [7, 8], [2], [4]]
[5, 6] <- [4]
true
Clustering: 6 clusters
[[3], [1], [10, 9], [5, 6, 4], [7, 8], [2]]
[3] <- [2]
true
Clustering: 5 clusters
[[3, 2], [1], [10, 9], [5, 6, 4], [7, 8]]
[10, 9] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4, 5, 6, 4] <- [5, 6, 4]
false

The the [10, 9, 5, 6, 4, 5, 6, 4, ...] set just grows infinitely from there. [10,9,5,6,4,5,6,4 ......]组从那里开始无限增长。

Edit: to clarify, I'm using a HashSet<Integer> for each cluster in clusters (a HashSet<HashSet<Integer>>) . 编辑:澄清一下,我正在为集群中的每个集群使用HashSet<Integer>HashSet<HashSet<Integer>>)

Ah. 啊。 When you alter a value that is already in a Set (or a Map key), then it is not necessarily in the right position and hash codes will be cached. 当您更改已在Set (或Map键)中的值时,它不一定在正确的位置,并且将缓存哈希码。 You need to remove it, alter it and then re-insert it. 您需要将其删除,更改它然后重新插入它。

In the test shown, the remove fails the first time you try to remove a Collection containing more than one Integer. 在显示的测试中,第一次尝试删除包含多个Integer的Collection时, remove失败。 Is this always the case? 总是这样吗?

What is the concrete type of the Collection used? Collection使用的具体类型是什么?

The obvious problem there is that clusters.remove is probably using equals to find the element to remove. 显而易见的问题是clusters.remove可能使用equals来找到要删除的元素。 Unfortunately equals on collections generally compares whether the elements are the same, rather than if it is the same collection (I believe C# makes a better choice in this regard). 不幸的是,集合上的equals通常会比较元素是否相同,而不是相同的集合(我相信C#在这方面做出了更好的选择)。

AN easy fix is to create clusters as Collections.newSetFromMap(new IdentityHashMap<Collection<Integer>, Boolean>()) (I think). 一个简单的解决方法是将clusters创建为Collections.newSetFromMap(new IdentityHashMap<Collection<Integer>, Boolean>()) (我认为)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM