简体   繁体   English

为什么在 flatMap() 之后的 filter() 在 Java 流中“不完全”懒惰?

[英]Why filter() after flatMap() is "not completely" lazy in Java streams?

I have the following sample code:我有以下示例代码:

System.out.println(
       "Result: " +
        Stream.of(1, 2, 3)
                .filter(i -> {
                    System.out.println(i);
                    return true;
                })
                .findFirst()
                .get()
);
System.out.println("-----------");
System.out.println(
       "Result: " +
        Stream.of(1, 2, 3)
                .flatMap(i -> Stream.of(i - 1, i, i + 1))
                .flatMap(i -> Stream.of(i - 1, i, i + 1))
                .filter(i -> {
                    System.out.println(i);
                    return true;
                })
                .findFirst()
                .get()
);

The output is as follows:输出如下:

1
Result: 1
-----------
-1
0
1
0
1
2
1
2
3
Result: -1

From here I see that in first case stream really behaves lazily - we use findFirst() so once we have first element our filtering lambda is not invoked.从这里我看到,在第一种情况下, stream行为确实很懒惰 - 我们使用findFirst()所以一旦我们有了第一个元素,我们的过滤 lambda 就不会被调用。 However, in second case which uses flatMap s we see that despite first element which fulfils the filter condition is found (it's just any first element as lambda always returns true) further contents of the stream are still being fed through filtering function.然而,在使用flatMap的第二种情况下,我们看到尽管找到了满足过滤条件的第一个元素(它只是任何第一个元素,因为 lambda 总是返回 true)流的其他内容仍然通过过滤函数提供。

I am trying to understand why it behaves like this rather than giving up after first element is calculated as in the first case.我试图理解为什么它的行为是这样的,而不是像第一种情况那样在计算第一个元素后放弃。 Any helpful information would be appreciated.任何有用的信息将不胜感激。

TL;DR, this has been addressed in JDK-8075939 and fixed in Java 10 (and backported to Java 8 in JDK-8225328 ). TL; DR,这已在JDK-8075939 中解决并在 Java 10 中修复(并在JDK-8225328 中向后移植到Java 8 )。

When looking into the implementation ( ReferencePipeline.java ) we see the method [ link ]在查看实现( ReferencePipeline.java )时,我们看到了方法 [ 链接]

@Override
final void forEachWithCancel(Spliterator<P_OUT> spliterator, Sink<P_OUT> sink) {
    do { } while (!sink.cancellationRequested() && spliterator.tryAdvance(sink));
}

which will be invoke for findFirst operation.这将被调用以进行findFirst操作。 The special thing to take care about is the sink.cancellationRequested() which allows to end the loop on the first match.需要特别注意的是sink.cancellationRequested() ,它允许在第一次匹配时结束循环。 Compare to [ link ]与[ 链接]比较

@Override
public final <R> Stream<R> flatMap(Function<? super P_OUT, ? extends Stream<? extends R>> mapper) {
    Objects.requireNonNull(mapper);
    // We can do better than this, by polling cancellationRequested when stream is infinite
    return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
                                 StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT | StreamOpFlag.NOT_SIZED) {
        @Override
        Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
            return new Sink.ChainedReference<P_OUT, R>(sink) {
                @Override
                public void begin(long size) {
                    downstream.begin(-1);
                }

                @Override
                public void accept(P_OUT u) {
                    try (Stream<? extends R> result = mapper.apply(u)) {
                        // We can do better that this too; optimize for depth=0 case and just grab spliterator and forEach it
                        if (result != null)
                            result.sequential().forEach(downstream);
                    }
                }
            };
        }
    };
}

The method for advancing one item ends up calling forEach on the sub-stream without any possibility for earlier termination and the comment at the beginning of the flatMap method even tells about this absent feature.推进一项的方法最终在子流上调用forEach ,而没有任何提前终止的可能性,并且flatMap方法开头的注释甚至说明了这个缺失的特性。

Since this is more than just an optimization thing as it implies that the code simply breaks when the sub-stream is infinite, I hope that the developers soon prove that they “can do better than this”…由于这不仅仅是一个优化的事情,因为它意味着当子流无限时代码会简单地中断,我希望开发人员很快证明他们“可以做得更好”......


To illustrate the implications, while Stream.iterate(0, i->i+1).findFirst() works as expected, Stream.of("").flatMap(x->Stream.iterate(0, i->i+1)).findFirst() will end up in an infinite loop.为了说明含义,虽然Stream.iterate(0, i->i+1).findFirst()按预期工作,但Stream.of("").flatMap(x->Stream.iterate(0, i->i+1)).findFirst()将进入无限循环。

Regarding the specification, most of it can be found in the关于规范,大部分都可以在

chapter “Stream operations and pipelines” of the package specification : 包规范的“流操作和管道”一章

Intermediate operations return a new stream.中间操作返回一个新的流。 They are always lazy ;他们总是很懒惰

… Laziness also allows avoiding examining all the data when it is not necessary; ……懒惰还可以避免在不必要时检查所有数据; for operations such as "find the first string longer than 1000 characters", it is only necessary to examine just enough strings to find one that has the desired characteristics without examining all of the strings available from the source.对于诸如“查找第一个长度超过 1000 个字符的字符串”之类的操作,只需检查足够多的字符串即可找到具有所需特征的字符串,而无需检查源中所有可用的字符串。 (This behavior becomes even more important when the input stream is infinite and not merely large.) (当输入流是无限的而不仅仅是大时,这种行为变得更加重要。)

Further, some operations are deemed short-circuiting operations.此外,一些操作被视为短路操作。 An intermediate operation is short-circuiting if, when presented with infinite input, it may produce a finite stream as a result.如果中间操作在呈现无限输入时可能因此产生有限流,则它是短路的。 A terminal operation is short-circuiting if, when presented with infinite input, it may terminate in finite time.如果终端操作在无限输入时可能在有限时间内终止,则它是短路的。 Having a short-circuiting operation in the pipeline is a necessary, but not sufficient, condition for the processing of an infinite stream to terminate normally in finite time.在管道中进行短路操作是无限流处理在有限时间内正常终止的必要条件,但不是充分条件。

It's clear that a short-circuiting operation doesn't guaranty a finite time termination, eg when a filter doesn't match any item the processing can't complete, but an implementation which doesn't support any termination in finite time by simply ignoring the short-circuiting nature of an operation is far off the specification.很明显,短路操作并不能保证有限时间终止,例如,当过滤器不匹配任何项目时,处理无法完成,但通过简单地忽略不支持有限时间内任何终止的实现操作的短路特性与规范相去甚远。

The elements of the input stream are consumed lazily one by one.输入流的元素被一一惰性消耗。 The first element, 1 , is transformed by the two flatMap s into the stream -1, 0, 1, 0, 1, 2, 1, 2, 3 , so that entire stream corresponds to just the first input element.第一个元素1被两个flatMap转换为流-1, 0, 1, 0, 1, 2, 1, 2, 3 ,因此整个流只对应于第一个输入元素。 The nested streams are eagerly materialized by the pipeline, then flattened, then fed to the filter stage.嵌套的流由管道急切地实现,然后展平,然后馈送到filter阶段。 This explains your output.这解释了您的输出。

The above does not stem from a fundamental limitation, but it would probably make things much more complicated to get full-blown laziness for nested streams.以上并不是源于一个基本的限制,但它可能会使嵌套流的完全惰性变得更加复杂。 I suspect it would be an even greater challenge to make it performant.我怀疑让它具有高性能将是一个更大的挑战。

For comparison, Clojure's lazy seqs get another layer of wrapping for each such level of nesting.相比之下,Clojure 的惰性 seqs 为每个这样的嵌套级别获得了另一层包装。 Due to this design, the operations may even fail with StackOverflowError when nesting is exercised to the extreme.由于这种设计,当嵌套执行到极致时,操作甚至可能会失败并出现StackOverflowError

With regard to breakage with infinite sub-streams, the behavior of flatMap becomes still more surprising when one throws in an intermediate (as opposed to terminal) short-circuiting operation.关于无限子流的破坏,当一个中间(而不是终端)短路操作时, flatMap 的行为变得更加令人惊讶。

While the following works as expected, printing out the infinite sequence of integers虽然以下按预期工作,但打印出整数的无限序列

Stream.of("x").flatMap(_x -> Stream.iterate(1, i -> i + 1)).forEach(System.out::println);

the following code prints out only the "1", but still does not terminate:下面的代码打印出只有“1”,但仍不会终止:

Stream.of("x").flatMap(_x -> Stream.iterate(1, i -> i + 1)).limit(1).forEach(System.out::println);

I cannot imagine a reading of the spec in which that were not a bug.我无法想象在阅读规范时这不是错误。

In my free StreamEx library I introduced the short-circuiting collectors.在我的免费StreamEx库中,我介绍了短路收集器。 When collecting sequential stream with short-circuiting collector (like MoreCollectors.first() ) exactly one element is consumed from the source.当使用短路收集器(如MoreCollectors.first() )收集顺序流时,源中MoreCollectors.first()消耗一个元素。 Internally it's implemented in quite dirty way: using a custom exception to break the control flow.在内部,它以非常肮脏的方式实现:使用自定义异常来中断控制流。 Using my library your sample could be rewritten in this way:使用我的库,您的示例可以这样重写:

System.out.println(
        "Result: " +
                StreamEx.of(1, 2, 3)
                .flatMap(i -> Stream.of(i - 1, i, i + 1))
                .flatMap(i -> Stream.of(i - 1, i, i + 1))
                .filter(i -> {
                    System.out.println(i);
                    return true;
                })
                .collect(MoreCollectors.first())
                .get()
        );

The result is the following:结果如下:

-1
Result: -1

While JDK-8075939 has been fixed in Java 11 and backported to 10 and 8u222, there's still an edge case of flatMap() not being truly lazy when using Stream.iterator() : JDK-8267359 , still present in Java 17.虽然JDK-8075939已在 Java 11 中修复并向后移植到 10 和 8u222,但仍然存在使用Stream.iterator()flatMap()不是真正懒惰的边缘情况: JDK-8267359 ,仍然存在于 Java 17 中。

This这个

Iterator<Integer> it =
    Stream.of("a", "b")
        .flatMap(s -> Stream
            .of(1, 2, 3, 4)
            .filter(i -> { System.out.println(i); return true; }))
        .iterator();

it.hasNext(); // This consumes the entire flatmapped stream
it.next();

Prints印刷

1
2
3
4

While this:虽然这个:

Iterator<Integer> it =
    Stream.of("a", "b")
        .flatMap(s -> Stream
            .iterate(1, i -> i)
            .filter(i -> { System.out.println(i); return true; }))
        .iterator();

it.hasNext();
it.next();

Never terminates永不终止

I agree with other people this is a bug opened at JDK-8075939 .我同意其他人的看法,这是一个在JDK-8075939 上打开的错误。 And since it's still not fixed more than one year later.而且由于它在一年多后仍未修复。 I would like to recommend you: AbacusUtil我想向您推荐: AbacusUtil

N.println("Result: " + Stream.of(1, 2, 3).peek(N::println).first().get());

N.println("-----------");

N.println("Result: " + Stream.of(1, 2, 3)
                        .flatMap(i -> Stream.of(i - 1, i, i + 1))
                        .flatMap(i -> Stream.of(i - 1, i, i + 1))
                        .peek(N::println).first().get());

// output:
// 1
// Result: 1
// -----------
// -1
// Result: -1

Disclosure: I'm the developer of AbacusUtil.披露:我是AbacusUtil的开发者。

Unfortunately .flatMap() is not lazy.不幸的是.flatMap()并不懒惰。 However, a custom flatMap workaround is available here: Why .flatMap() is so inefficient (non lazy) in java 8 and java 9但是,此处提供了自定义flatMap解决方法: Why .flatMap() is so inefficient (non lazy) in java 8 and java 9

Today I also stumbled up on this bug.今天我也偶然发现了这个错误。 Behavior is not so strait forward, cause simple case, like below, is working fine, but similar production code doesn't work.行为不是那么严格,导致简单的情况,如下所示,工作正常,但类似的生产代码不起作用。

 stream(spliterator).map(o -> o).flatMap(Stream::of).flatMap(Stream::of).findAny()

For guys who cannot wait another couple years for migration to JDK-10 there is a alternative true lazy stream.对于那些不能再等几年迁移到 JDK-10 的人来说,有一个替代的真正的惰性流。 It doesn't support parallel.它不支持并行。 It was dedicated for JavaScript translation, but it worked out for me, cause interface is the same.它专用于 JavaScript 翻译,但它对我有用,因为界面是相同的。

StreamHelper is collection based, but it is easy to adapt Spliterator. StreamHelper 是基于集合的,但它很容易适应 Spliterator。

https://github.com/yaitskov/j4ts/blob/stream/src/main/java/javaemul/internal/stream/StreamHelper.java https://github.com/yaitskov/j4ts/blob/stream/src/main/java/javaemul/internal/stream/StreamHelper.java

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM