简体   繁体   English

使用Java流的MinMaxPriorityQueue

[英]MinMaxPriorityQueue using Java streams

I am looking for a memory-efficient way in Java to find top n elements from a huge collection. 我正在寻找一种内存有效的Java方法来从巨大的集合中找到前n个元素。 For instance, I have a word, a distance() method, and a collection of "all" words. 例如,我有一个单词,distance()方法和“all”单词的集合。 I have implemented a class Pair that implements compareTo() so that pairs are sorted by their values. 我已经实现了一个实现compareTo()的类Pair,以便按对它们的值进行排序。

Using streams, my naive solution looks like this: 使用流,我的天真解决方案看起来像这样:

double distance(String word1, String word2){
  ...
}

Collection<String> words = ...;
String word = "...";

words.stream()
  .map(w -> new Pair<String, Double>(w, distance(word, w)))
  .sorted()
  .limit(n);

To my understanding, this will process and intermediately store each element in words so that it can be sorted before applying limit(). 据我所知,这将处理并中间存储每个元素的单词,以便在应用limit()之前对其进行排序。 However, it is more memory-efficient to have a collection that stores n elements and whenever a new element is added, it removes the smallest element (according to the comparable object's natural order) and thus never grows larger than n (or n+1). 但是,拥有一个存储n个元素的集合更加节省内存,每当添加一个新元素时,它会删除最小的元素(根据可比对象的自然顺序),因此永远不会大于n(或n + 1) )。

This is exactly what the Guava MinMaxPriorityQueue does. 这正是Guava MinMaxPriorityQueue所做的。 Thus, my current best solution to the above problem is this: 因此,我目前对上述问题的最佳解决方案是:

Queue<Pair<String, Double>> neighbours = MinMaxPriorityQueue.maximumSize(n).create();
words.stream()
  .forEach(w -> neighbours.add(new Pair<String, Double>(w, distance(word, w)));

The sorting of the top n elements remains to be done after converting the queue to a stream or list, but this is not an issue since n is relatively small. 在将队列转换为流或列表之后,仍然需要对前n个元素进行排序,但这不是问题,因为n相对较小。

My question is: is there a way to do the same using streams? 我的问题是:有没有办法使用流做同样的事情?

A heap-based structure will of course be more efficient than sorting the entire huge list. 基于堆的结构当然比排序整个庞大的列表更有效。 Luckily, streams library is perfectly happy to let you use specialized collections when necessary: 幸运的是,流媒体库非常乐意让您在必要时使用专门的集合:

MinMaxPriorityQueue<Pair<String, Double>> topN = words.stream()
    .map(w -> new Pair<String, Double>(w, distance(word, w)))
    .collect(toCollection(
            () -> MinMaxPriorityQueue.maximumSize(n).create()
    ));

This is better than the .forEach solution because it's easy to parallelize and is more idiomatic java8. 这比.forEach解决方案更好,因为它很容易并行化,而且更加惯用java8。

Note that () -> MinMaxPriorityQueue.maximumSize(n).create() should be possible to be replaced with MinMaxPriorityQueue.maximumSize(n)::create but, for some reason, that won't compile under some conditions (see comments below). 注意() -> MinMaxPriorityQueue.maximumSize(n).create()应该可以用MinMaxPriorityQueue.maximumSize(n)::create替换,但由于某种原因,在某些情况下不能编译(参见下面的注释) )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM