简体   繁体   English

Stream.sorted()。limit()的性能

[英]performance of Stream.sorted().limit()

Java Streams sport both sorted and limit methods, which respectively return a sorted version of a stream and return a stream just returning a specified number of items of a stream. Java Streams同时运行sortedlimit方法,它们分别返回流的排序版本并返回一个流,只返回指定数量的流项。 When these operations are applied in succession, such as in: 当这些操作连续应用时,例如:

stream.sorted().limit(qty).collect(Collectors.toList())

is the sorting is performed in a way that sorts qty items or is the entire list sorted? 排序是以对qty项进行排序还是整个列表排序的方式执行的? In other words, if qty is fixed, is this operation in O(n) ? 换句话说,如果qty是固定的,那么这个操作是否在O(n) The documentation doesn't specify the performance of these methods alone or in conjunction with each other. 文档没有单独指定这些方法的性能或相互结合使用。

The reason I ask is that the obvious imperative implementation of these operations would be to sort and then limit, taking time Θ(n * log(n)) . 我问的原因是这些操作的明显命令性实现是排序然后限制,取时间Θ(n * log(n)) But these operations together can be performed in O(n * log(qty)) and a smart streaming framework could view the entire stream before executing it to optimize this special case. 但是这些操作可以在O(n * log(qty)) ,智能流式框架可以在执行之前查看整个流以优化此特殊情况。

Let me start by making the general point that the Java language specification places few restrictions on how streams are implemented. 让我首先指出Java语言规范对如何实现流的限制很少。 So it's really not too meaningful to ask about the performance of Java streams: it will vary significantly between implementations. 因此,询问Java流的性能真的没有太大意义:它们在实现之间会有很大差异。

Also note that Stream is an interface. 另请注意, Stream是一个界面。 You can create your own class that implements Stream to have any performance or special behaviour on sorted that you want. 您可以创建自己的类来实现Stream以便根据需要对sorted执行任何性能或特殊行为。 So really asking about the performance of Stream makes no sense even within the context of one implementation. 因此,即使在一个实现的上下文中,真正询问Stream的性能也没有意义。 The OpenJDK implementation has lots of classes that implement the Stream interface. OpenJDK实现有许多实现Stream接口的类。

Having said that, if we look at the OpenJDK implementation, sorting of streams ends up in SortedOps class (see source here ) you will find that the sorting methods end up returning extensions of stateful operations. 话虽如此,如果我们看一下OpenJDK实现, SortedOps的排序最终会在SortedOps类中进行(参见这里的源代码),你会发现排序方法最终会返回有状态操作的扩展。 For example: 例如:

private static final class OfInt extends IntPipeline.StatefulOp<Integer>

These methods check if the upstream is already sorted in which case they just pass it to the downstream. 这些方法检查上游是否已经排序,在哪种情况下它们只是将它传递给下游。 They also have special exceptions for sized streams (ie upstream) which preallocate the arrays that they end up sorting which will improve efficiency (over a SpinedBuffer that they use for unknown size streams). 它们对于大小的流(即上游)也有特殊的例外情况,它们预先分配它们最终排序的数组,这将提高效率(通过它们用于未知大小流的SpinedBuffer )。 But whenever the upstream is not already sorted they accept all items, then sort them and then send to the accept method of the downstream instance. 但是,只要上游尚未排序,它们就接受所有项目,然后对它们进行排序,然后发送到下游实例的accept方法。

So the conclusion from this is that the OpenJDK sorted implementation collects all items, then sorts, then sends downstream. 因此得出的结论是,OpenJDK sorted实现收集所有项目,然后排序,然后发送到下游。 In some cases this will be wasting resources when the downstream will then discard some elements. 在某些情况下,当下游将丢弃某些元素时,这将浪费资源。 You are free to implement your own specialised sort operation that is more efficient than this for special cases. 对于特殊情况,您可以自由地实现自己的专用排序操作,该操作比此更有效。 Probably the most straightforward way is to implement a Collector that keeps a list of the n largest or smallest items in the stream. 可能最直接的方法是实现一个Collector ,它保存流中n个最大或最小项的列表。 Your operation might then look something like: 您的操作可能看起来像:

.collect(new CollectNthLargest(4)).stream()

To replace 取代

.sorted().limit(4)

There's a special collector in my StreamEx library which performs this operation: MoreCollectors.least(qty) : 我的StreamEx库中有一个特殊的收集器,它执行此操作: MoreCollectors.least MoreCollectors.least(qty)

List<?> result = stream.collect(MoreCollectors.least(qty));

It uses PriorityQueue inside and actually works significantly faster with small qty on unsorted inputs. 它在内部使用 PriorityQueue,并且在未排序的输入上使用小数量时实际上工作得更快。 Note however if input is mostly sorted, then sorted().limit(qty) may work faster as TimSort is incredibly fast for presorted data. 但是请注意,如果输入主要是排序的,那么sorted().limit(qty)可能会更快,因为TimSort对于预分类数据来说速度非常快。

That's implementation-dependent and might also depend on whether the stream pipeline can "see through" potential operations between the sorted() and the limit() . 这取决于实现,也可能取决于流管道是否可以“看穿” sorted()limit()之间的潜在操作。

Even if you were to ask about the OpenJDK implementation it is subject to change since the javadocs make no guarantees about the runtime behavior. 即使您要询问OpenJDK实现,它也可能会发生变化,因为javadocs不保证运行时行为。 But no, currently it does not implement a k-min selection algorithm. 但不,目前它没有实现k-min选择算法。

You also have to keep in mind that sorted() doesn't work on infinite streams unless they already have the SORTED characteristic. 您还必须记住, sorted()不适用于无限流,除非它们已经具有SORTED特性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM