[英]performance of Stream.sorted().limit()
Java Streams sport both sorted
and limit
methods, which respectively return a sorted version of a stream and return a stream just returning a specified number of items of a stream. Java Streams同时运行
sorted
和limit
方法,它们分别返回流的排序版本并返回一个流,只返回指定数量的流项。 When these operations are applied in succession, such as in: 当这些操作连续应用时,例如:
stream.sorted().limit(qty).collect(Collectors.toList())
is the sorting is performed in a way that sorts qty
items or is the entire list sorted? 排序是以对
qty
项进行排序还是整个列表排序的方式执行的? In other words, if qty
is fixed, is this operation in O(n)
? 换句话说,如果
qty
是固定的,那么这个操作是否在O(n)
? The documentation doesn't specify the performance of these methods alone or in conjunction with each other. 文档没有单独指定这些方法的性能或相互结合使用。
The reason I ask is that the obvious imperative implementation of these operations would be to sort and then limit, taking time Θ(n * log(n))
. 我问的原因是这些操作的明显命令性实现是排序然后限制,取时间
Θ(n * log(n))
。 But these operations together can be performed in O(n * log(qty))
and a smart streaming framework could view the entire stream before executing it to optimize this special case. 但是这些操作可以在
O(n * log(qty))
,智能流式框架可以在执行之前查看整个流以优化此特殊情况。
Let me start by making the general point that the Java language specification places few restrictions on how streams are implemented. 让我首先指出Java语言规范对如何实现流的限制很少。 So it's really not too meaningful to ask about the performance of Java streams: it will vary significantly between implementations.
因此,询问Java流的性能真的没有太大意义:它们在实现之间会有很大差异。
Also note that Stream
is an interface. 另请注意,
Stream
是一个界面。 You can create your own class that implements Stream
to have any performance or special behaviour on sorted
that you want. 您可以创建自己的类来实现
Stream
以便根据需要对sorted
执行任何性能或特殊行为。 So really asking about the performance of Stream
makes no sense even within the context of one implementation. 因此,即使在一个实现的上下文中,真正询问
Stream
的性能也没有意义。 The OpenJDK implementation has lots of classes that implement the Stream
interface. OpenJDK实现有许多实现
Stream
接口的类。
Having said that, if we look at the OpenJDK implementation, sorting of streams ends up in SortedOps
class (see source here ) you will find that the sorting methods end up returning extensions of stateful operations. 话虽如此,如果我们看一下OpenJDK实现,
SortedOps
的排序最终会在SortedOps
类中进行(参见这里的源代码),你会发现排序方法最终会返回有状态操作的扩展。 For example: 例如:
private static final class OfInt extends IntPipeline.StatefulOp<Integer>
These methods check if the upstream is already sorted in which case they just pass it to the downstream. 这些方法检查上游是否已经排序,在哪种情况下它们只是将它传递给下游。 They also have special exceptions for sized streams (ie upstream) which preallocate the arrays that they end up sorting which will improve efficiency (over a
SpinedBuffer
that they use for unknown size streams). 它们对于大小的流(即上游)也有特殊的例外情况,它们预先分配它们最终排序的数组,这将提高效率(通过它们用于未知大小流的
SpinedBuffer
)。 But whenever the upstream is not already sorted they accept all items, then sort them and then send to the accept
method of the downstream instance. 但是,只要上游尚未排序,它们就接受所有项目,然后对它们进行排序,然后发送到下游实例的
accept
方法。
So the conclusion from this is that the OpenJDK sorted
implementation collects all items, then sorts, then sends downstream. 因此得出的结论是,OpenJDK
sorted
实现收集所有项目,然后排序,然后发送到下游。 In some cases this will be wasting resources when the downstream will then discard some elements. 在某些情况下,当下游将丢弃某些元素时,这将浪费资源。 You are free to implement your own specialised sort operation that is more efficient than this for special cases.
对于特殊情况,您可以自由地实现自己的专用排序操作,该操作比此更有效。 Probably the most straightforward way is to implement a
Collector
that keeps a list of the n largest or smallest items in the stream. 可能最直接的方法是实现一个
Collector
,它保存流中n个最大或最小项的列表。 Your operation might then look something like: 您的操作可能看起来像:
.collect(new CollectNthLargest(4)).stream()
To replace 取代
.sorted().limit(4)
There's a special collector in my StreamEx library which performs this operation: MoreCollectors.least(qty)
: 我的StreamEx库中有一个特殊的收集器,它执行此操作: MoreCollectors.least
MoreCollectors.least(qty)
:
List<?> result = stream.collect(MoreCollectors.least(qty));
It uses PriorityQueue inside and actually works significantly faster with small qty on unsorted inputs. 它在内部使用 PriorityQueue,并且在未排序的输入上使用小数量时实际上工作得更快。 Note however if input is mostly sorted, then
sorted().limit(qty)
may work faster as TimSort is incredibly fast for presorted data. 但是请注意,如果输入主要是排序的,那么
sorted().limit(qty)
可能会更快,因为TimSort对于预分类数据来说速度非常快。
That's implementation-dependent and might also depend on whether the stream pipeline can "see through" potential operations between the sorted()
and the limit()
. 这取决于实现,也可能取决于流管道是否可以“看穿”
sorted()
和limit()
之间的潜在操作。
Even if you were to ask about the OpenJDK implementation it is subject to change since the javadocs make no guarantees about the runtime behavior. 即使您要询问OpenJDK实现,它也可能会发生变化,因为javadocs不保证运行时行为。 But no, currently it does not implement a k-min selection algorithm.
但不,目前它没有实现k-min选择算法。
You also have to keep in mind that sorted()
doesn't work on infinite streams unless they already have the SORTED
characteristic. 您还必须记住,
sorted()
不适用于无限流,除非它们已经具有SORTED
特性。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.