简体   繁体   English

Java或groovy中包含整数的集合之间的相似性

[英]Similarity between Sets containing Integers in java or groovy

I have HashSet<Integer> A and B I want to compare to get a numeric value how similar they are (eg 0.9 if 90% of A and B are the same). 我有HashSet<Integer> AB我想比较以得到一个数值,它们多么相似(例如,如果AB 90%相同,则为0.9)。 What is the best (fastest) way to do this in java or groovy? 用Java或groovy做到这一点的最佳(最快)方法是什么?

My naive way to do this is to get all equal elements from A and B and divide the size of them through the original size of A . 我这样做的天真的方法是从AB获得所有相等的元素,然后将它们的大小除以A的原始大小。 Is there any reason (speed eg) why this wouldn't work properly? 有什么原因(例如速度)为什么不能正常工作? Generally speaking I would prefer any already implemented way to get the similarity. 一般来说,我更喜欢任何已经实现的方法来获得相似性。

Note: Comparing 1, 2 to 12 should be 0% similarity. 注意:比较1, 212相似度应为0%。

The only way to calculate the similarity of 2 arbitrary HashSets of size M and N is to choose the smallest one and check if its elements are present in a bigger one. 计算2个大小为M和N的任意HashSet相似度的唯一方法是选择最小的HashSet,然后检查其元素是否存在于较大的HashSet中。 There is no such method in JDK. JDK中没有这样的方法。 If you're looking for the fastest solution, write your own: 如果您正在寻找最快的解决方案,请编写自己的解决方案:

int count = 0;
for (E element : smallSet) {
    if (bigSet.contains(element) {
       count++;
    }
}

If you don't care much about performance and extra memory, you can use 如果您不太在意性能和额外的内存,则可以使用

int count = new HashSet(smallSet).retainAll(bigSet);

or similar method Sets#intersection(Set, Set) from Guava library 或类似的方法从Guava库中获取Sets#intersection(Set, Set)

Like Adam suggests, a loop is the most efficient way to find the size of the intersection 就像亚当建议的那样,循环是找到交叉点大小的最有效方法

public static int intersectionsCount(Set set1, Set set2) {
    if (set2.size() < set1.size()) return intersectionsCount(set2, set1);
    int count = 0;
    for (Object o : set1)
        if (set2.contains(o)) count++;
    return count;
}

public static double commonRatio(Set set1, Set set2) {
    int common = intersectionsCount(set1, set2);
    int union = set1.size() + set2.size() - common;
    return (double) common / union; // [0.0, 1.0]
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM