[英]What is the most efficient way to count the intersections between two sets (Java)?
I am comparing two (at a time, actually many) text files, and I want to determine how similar they are. 我正在比较两个(一次,实际上很多)文本文件,我想确定它们有多相似。 To do so, I have created small, overlapping groups of text from each file.
为此,我从每个文件创建了小的,重叠的文本组。 I now want to determine the number of those groups from one file which are also from the other file.
我现在想要从一个文件中确定这些组的数量,这些组也来自另一个文件。
I would prefer to use only Java 8 with no external libraries. 我更喜欢只使用没有外部库的Java 8。
These are my two fastest methods. 这是我最快的两种方法。 The first contains a bunch of logic which allows it to stop if meeting the threshold is not possible with the remaining elements (this saves a bit of time in total, but of course executing the extra logic also takes time).
第一个包含一堆逻辑,如果其余元素无法满足阈值,则允许它停止(这总共节省了一些时间,但当然执行额外的逻辑也需要时间)。 The second is slower.
第二个是慢。 It does not have those optimizations, actually determines the intersection rather than merely counting it, and uses a stream, which is quite new to me.
它没有那些优化,实际上确定了交集,而不是仅仅计算它,并使用一个流,这对我来说是一个新的。
I have an integer threshold and dblThreshold (the same value cast to a double), which are the minimum percentage of the smaller file which must be shared to be of interest. 我有一个整数阈值和dblThreshold(相同的值转换为double),这是必须共享的较小文件的最小百分比。 Also, from my limited testing, it seems that writing all the logic for either set being larger is faster than calling the method again with reversed arguments.
此外,从我的有限测试中,似乎写入任何一个集合的所有逻辑都比使用反向参数再次调用该方法更快。
public int numberShared(Set<String> sOne, Set<String> sTwo) {
int numFound = 0;
if (sOne.size() > sTwo.size()) {
int smallSize = sTwo.size();
int left = smallSize;
for (String item: sTwo) {
if (numFound < threshold && ((double)numFound + left < (dblThreshold) * smallSize)) {
break;
}
if (sOne.contains(item)) {
numFound++;
}
left--;
}
} else {
int smallSize = sOne.size();
int left = smallSize;
for (String item: sOne) {
if (numFound < threshold && ((double)numFound + left < (dblThreshold) * smallSize)) {
break;
}
if (sTwo.contains(item)) {
numFound++;
}
left--;
}
}
return numFound;
}
Second method: 第二种方法:
public int numberShared(Set<String> sOne, Set<String> sTwo) {
if (sOne.size() < sTwo.size()) {
long numFound = sOne.parallelStream()
.filter(segment -> sTwo.contains(segment))
.collect(Collectors.counting());
return (int)numFound;
} else {
long numFound = sTwo.parallelStream()
.filter(segment -> sOne.contains(segment))
.collect(Collectors.counting());
return (int)numFound;
}
}
Any suggestions for improving upon these methods, or novel ideas and approaches to the problem are much appreciated! 任何改进这些方法的建议,或者对问题的新想法和方法都非常感谢!
Edit: I just realized that the first part of my threshold check (which seeks to eliminate, in some cases, the need for the second check with doubles) is incorrect. 编辑:我刚刚意识到我的阈值检查的第一部分(在某些情况下,试图消除需要第二次检查双打)是不正确的。 I will revise it as soon as possible.
我会尽快修改它。
If I understand you correctly, you have already determined which methods are fastest, but aren't sure how to implement your threshold-check when using Java 8 streams. 如果我理解正确,您已经确定哪些方法最快,但不确定如何在使用Java 8流时实施阈值检查。 Here's one way you could do that - though please note that it's hard for me to do much testing without having proper data and knowing what thresholds you're interested in, so take this simplified test case with a grain of salt (and adjust as necessary).
这是你可以做到的一种方式 - 虽然请注意,如果没有适当的数据并且知道你感兴趣的阈值,我很难做很多测试,所以把这个简化的测试用例带上一粒盐(并根据需要进行调整) )。
public class Sets {
private static final int NOT_ENOUGH_MATCHES = -1;
private static final String[] arrayOne = { "1", "2", "4", "9" };
private static final String[] arrayTwo = { "2", "3", "5", "7", "9" };
private static final Set<String> setOne = new HashSet<>();
private static final Set<String> setTwo = new HashSet<>();
public static void main(String[] ignoredArguments) {
setOne.addAll(Arrays.asList(arrayOne));
setTwo.addAll(Arrays.asList(arrayTwo));
boolean isFirstSmaller = setOne.size() < setTwo.size();
System.out.println("Number shared: " + (isFirstSmaller ?
numberShared(setOne, setTwo) : numberShared(setTwo, setOne)));
}
private static long numberShared(Set<String> smallerSet, Set<String> largerSet) {
SimpleBag bag = new SimpleBag(3, 0.5d, largerSet, smallerSet.size());
try {
smallerSet.forEach(eachItem -> bag.add(eachItem));
return bag.duplicateCount;
} catch (IllegalStateException exception) {
return NOT_ENOUGH_MATCHES;
}
}
public static class SimpleBag {
private Map<String, Boolean> items;
private int threshold;
private double fraction;
protected int duplicateCount = 0;
private int smallerSize;
private int numberLeft;
public SimpleBag(int aThreshold, double aFraction, Set<String> someStrings,
int otherSetSize) {
threshold = aThreshold;
fraction = aFraction;
items = new HashMap<>();
someStrings.forEach(eachString -> items.put(eachString, false));
smallerSize = otherSetSize;
numberLeft = otherSetSize;
}
public void add(String aString) {
Boolean value = items.get(aString);
boolean alreadyExists = value != null;
if (alreadyExists) {
duplicateCount++;
}
items.put(aString, alreadyExists);
numberLeft--;
if (cannotMeetThreshold()) {
throw new IllegalStateException("Can't meet threshold; stopping at "
+ duplicateCount + " duplicates");
}
}
public boolean cannotMeetThreshold() {
return duplicateCount < threshold
&& (duplicateCount + numberLeft < fraction * smallerSize);
}
}
}
So I've made a simplified "Bag-like" implementation that starts with the contents of the larger set mapped as keys to false
values (since we know there's only one of each). 所以我做了一个简化的“类似Bag”的实现,它从较大集合的内容开始映射为
false
值的键(因为我们知道每个中只有一个)。 Then we iterate over the smaller set, adding each item to the bag, and, if it's a duplicate, switching the value to true
and keeping track of the duplicate count (I initially did a .count()
at the end of .stream().allMatch()
, but this'll suffice for your special case). 然后,我们遍历小集,将每个项目的袋子,如果它是一个重复的,切换值
true
重复计数和跟踪(我最初做了.count()
在结束.stream().allMatch()
,但这足以满足您的特殊情况)。 After adding each item, we check whether we can't meet the threshold, in which case we throw an exception (arguably not the prettiest way to exit the .forEach()
, but in this case it is an illegal state of sorts). 在添加每个项目之后,我们检查我们是否不能达到阈值,在这种情况下我们抛出一个异常(可能不是退出
.forEach()
的最漂亮的方法,但在这种情况下它是一种非法的状态)。 Finally, we return the duplicate count, or -1
if we encountered the exception. 最后,我们返回重复计数,如果遇到异常则返回
-1
。 In my little test, change 0.5d
to 0.51d
to see the difference. 在我的小测试中,将
0.5d
更改为0.51d
以查看差异。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.