简体   繁体   English

显示Java 8流处理的进度

[英]Show progress of Java 8 stream processing

I have a Stream processing a few millions of elements. 我有一个Stream处理几百万个元素。 The Map-Reduce algorithm behind it takes a few milliseconds, so task completion takes about twenty minutes. 它背后的Map-Reduce算法需要几毫秒,因此任务完成大约需要20分钟。

Stream<MyData> myStream = readData();
MyResult result = myStream
    .map(row -> process(row))
    .peek(stat -> System.out.println("Hi, I processed another item"))
    .reduce(MyStat::aggregate);

I'd like a way to display overall progress, instead of printing a line per element (which results in thousands of lines per second, takes time and doesn't provide any useful information regarding overall progress). 我想要一种显示整体进度的方法,而不是每个元素打印一行(这导致每秒数千行,需要时间,并且不提供有关整体进度的任何有用信息)。 I would like to display something similar to: 我想展示类似于:

 5% (08s)
10% (14s)
15% (20s)
...

What would be the best (and/or easiest) way to do this? 最好(和/或最简单)的方法是什么?

First of all, Streams are not meant to achieve these kind of tasks (as opposed to a classic data structure). 首先,Streams并不是要实现这些任务(而不是传统的数据结构)。 If you know already how many elements your stream will be processing you might go with the following option, which is, I repeat, not the goal of streams. 如果你已经知道你的流将处理多少元素,你可以选择以下选项,我重复一遍,而不是流的目标。

Stream<MyData> myStream = readData();
final AtomicInteger loader = new AtomicInteger();
int fivePercent = elementsCount / 20;
MyResult result = myStream
    .map(row -> process(row))
    .peek(stat -> {
        if (loader.incrementAndGet() % fivePercent == 0) {
            System.out.println(loader.get() + " elements on " + elementsCount + " treated");
            System.out.println((5*(loader.get() / fivePercent)) + "%");
        }
    })
    .reduce(MyStat::aggregate);

As others have pointed out: This has some caveats. 正如其他人所指出的:这有一些警告。 First of all, streams are not supposed to be used for something like this. 首先,流不应该用于这样的事情。

On a more technical level, one could further argue: 在更技术层面,人们可以进一步争论:

  • A stream can be infinite 流可以是无限的
  • Even if you know the number of elements: This number might be distorted by operations like filter or flatMap 即使您知道元素的数量:此数字可能会被filterflatMap等操作扭曲
  • For a parallel stream, tracking the progress will enforce a synchronization point 对于并行流,跟踪进度强制执行同步点
  • If there is a terminal operation that is expensive (like the aggregation in your case), then the reported progress might not even sensibly reflect the computation time 如果存在昂贵的终端操作 (如您的情况下的聚合),则报告的进度可能甚至不能合理地反映计算时间

However, keeping this in mind, one approach that might be reasonable for your application case is this: 但是,记住这一点,对您的应用案例可能合理的一种方法是:

You could create a Function<T,T> that is passed to a map of the stream. 您可以创建一个传递给流mapFunction<T,T> (At least, I'd prefer that over using peek on the stream, as suggested in another answer). (至少,我更喜欢在流上使用peek ,如另一个答案所示)。 This function could keep track of the progress, using an AtomicLong for counting the elements. 此功能可以使用AtomicLong计算元素来跟踪进度。 In order to keep separate things separate, this progress could then be just forwarded to a Consumer<Long> , which will take care of the presentation 为了将单独的事物分开,可以将此进度转发给Consumer<Long> ,它将负责演示

The "presentation" here refers to printing this progress to the console, normalized or as percentages, referring to a size that could be known wherever the consumer is created. 这里的“演示”是指将此进度打印到控制台,标准化或百分比,指的是在创建消费者的任何地方都可以知道的大小。 But the consumer can then also take care of only printing, for example, every 10th element, or only print a message if at least 5 seconds have passed since the previous one. 但是,消费者也可以仅处理打印,例如,每10个元素,或者如果自上一个元素以来已经过了至少5秒,则仅打印消息。

import java.util.Iterator;
import java.util.Locale;
import java.util.Spliterator;
import java.util.Spliterators;
import java.util.concurrent.atomic.AtomicLong;
import java.util.function.Function;
import java.util.function.LongConsumer;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;

public class StreamProgress
{
    public static void main(String[] args)
    {
        int size = 250;
        Stream<Integer> stream = readData(size);

        LongConsumer progressConsumer = progress -> 
        {
            // "Filter" the output here: Report only every 10th element
            if (progress % 10 == 0)
            {
                double relative = (double) progress / (size - 1);
                double percent = relative * 100;
                System.out.printf(Locale.ENGLISH,
                    "Progress %8d, relative %2.5f, percent %3.2f\n",
                    progress, relative, percent);
            }
        };

        Integer result = stream
            .map(element -> process(element))
            .map(progressMapper(progressConsumer))
            .reduce(0, (a, b) -> a + b);

        System.out.println("result " + result);
    }

    private static <T> Function<T, T> progressMapper(
        LongConsumer progressConsumer)
    {
        AtomicLong counter = new AtomicLong(0);
        return t -> 
        {
            long n = counter.getAndIncrement();
            progressConsumer.accept(n);
            return t;
        };

    }

    private static Integer process(Integer element)
    {
        return element * 2;
    }

    private static Stream<Integer> readData(int size)
    {
        Iterator<Integer> iterator = new Iterator<Integer>()
        {
            int n = 0;
            @Override
            public Integer next()
            {
                try
                {
                    Thread.sleep(10);
                }
                catch (InterruptedException e)
                {
                    e.printStackTrace();
                }
                return n++;
            }

            @Override
            public boolean hasNext()
            {
                return n < size;
            }
        };
        return StreamSupport.stream(
            Spliterators.spliteratorUnknownSize(
                iterator, Spliterator.ORDERED), false);
    }
}

The possibility of doing this highly depends on the type of source you are having in the stream . 这样做的可能性很大程度上取决于您在stream中使用的source类型。 If you have a collection and you want to apply some operations on it you can do it because you know what is the size of the collection and you can keep a count of processed elements. 如果你有一个集合,并且你想对它应用一些操作,你可以这样做,因为你知道集合的大小,你可以保留已处理元素的数量。 But there is a caveat also in this case. 但在这种情况下也有一个警告。 If you will be doing parallel computations in the stream, this becomes more difficult as well. 如果您将在流中进行并行计算,那么这也变得更加困难。

In the cases where you are streaming data from outside the application it is very difficult that you can model the progress as you don't know when the stream will end. 在您从应用程序外部传输数据的情况下,您很难对流程进行建模,因为您不知道流何时结束。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM