简体   繁体   English

Java 8 流条件处理

[英]Java 8 streams conditional processing

I'm interested in separating a stream into two or more substreams, and processing the elements in different ways.我有兴趣将一个流分成两个或多个子流,并以不同的方式处理元素。 For example, a (large) text file might contain lines of type A and lines of type B, in which case I'd like to do something like:例如,一个(大)文本文件可能包含 A 类型的行和 B 类型的行,在这种情况下,我想做类似的事情:

File.lines(path)
.filter(line -> isTypeA(line))
.forEachTrue(line -> processTypeA(line))
.forEachFalse(line -> processTypeB(line))

The previous is my attempt at abstracting the situation.上一个是我尝试抽象的情况。 In reality I have a very large text file where each line is testing against a regex;实际上,我有一个非常大的文本文件,其中每一行都在针对正则表达式进行测试; if the line passes, then it is processed, whereas if it is rejected, then I want to update a counter.如果该行通过,则对其进行处理,而如果该行被拒绝,则我想更新一个计数器。 This further processing on rejected strings is why I don't simply use filter .对拒绝字符串的进一步处理是我不简单使用filter的原因。

Is there any reasonable way to do this with streams, or will I have to fallback to loops?有什么合理的方法可以用流来做到这一点,还是我必须回退到循环? (I would like this to run in parallel as well, so streams are my first choice). (我也希望它可以并行运行,所以流是我的首选)。

Java 8 streams weren't designed to support this kind of operation. Java 8流不是为支持这种操作而设计的。 From the jdk : 来自jdk

A stream should be operated on (invoking an intermediate or terminal stream operation) only once. 应该仅对一个流进行操作(调用中间或终端流操作)。 This rules out, for example, "forked" streams, where the same source feeds two or more pipelines, or multiple traversals of the same stream. 例如,这排除了“分叉”流,其中相同的源提供两个或更多个管道,或者同一个流的多个遍历。

If you can store it in memory you can use Collectors.partitioningBy if you have just two types and go by with a Map<Boolean, List> . 如果你可以将它存储在内存中,你可以使用Collectors.partitioningBy如果你只有两种类型并使用Map<Boolean, List> Otherwise use Collectors.groupingBy . 否则使用Collectors.groupingBy

Simply test each element, and act accordingly. 只需测试每个元素,并采取相应的行动。

lines.forEach(line -> {
    if (isTypeA(line)) processTypeA(line);
    else processTypeB(line);
});

This behavior could be hidden in a helper method: 此行为可能隐藏在辅助方法中:

public static <T> Consumer<T> branch(Predicate<? super T> test, 
                                     Consumer<? super T> t, 
                                     Consumer<? super T> f) {
    return o -> {
        if (test.test(o)) t.accept(o);
        else f.accept(o);
    };
}

Then the usage would look like this: 然后用法如下:

lines.forEach(branch(this::isTypeA, this::processTypeA, this::processTypeB));

Tangential Note 切线说明

The Files.lines() method does not close the underlying file, so you must use it like this: Files.lines()方法不会关闭基础文件,因此您必须像这样使用它:

try (Stream<String> lines = Files.lines(path, encoding)) {
  lines.forEach(...);
}

Variables of Stream type throw up a bit of a red flag for me, so I prefer to manage a BufferedReader directly: Stream类型的变量为我抛出一点红旗,所以我更喜欢直接管理BufferedReader

try (BufferedReader lines = Files.newBufferedReader(path, encoding)) {
    lines.lines().forEach(...);
}

While side effects in behavioral parameters are discouraged, they are not forbidden, as long as there's no interference, so the simplest, though not cleanest solution is to count right in the filter: 虽然不鼓励使用行为参数中的副作用,但只要不存在干扰,它们就不会被禁止,所以最简单但不是最干净的解决方案是在过滤器中计算:

AtomicInteger rejected=new AtomicInteger();
Files.lines(path)
    .filter(line -> {
        boolean accepted=isTypeA(line);
        if(!accepted) rejected.incrementAndGet();
        return accepted;
})
// chain processing of matched lines

As long as you are processing all items, the result will be consistent. 只要您处理所有项目,结果将是一致的。 Only if you are using a short-circuiting terminal operation (in a parallel stream), the result will become unpredictable. 只有在使用短路终端操作(并行流)时,结果才会变得不可预测。

Updating an atomic variable may not be the most efficient solution, but in the context of processing lines from a file, the overhead will likely be negligible. 更新原子变量可能不是最有效的解决方案,但在处理来自文件的行的上下文中,开销可能可以忽略不计。

If you want a clean, parallel friendly solution, one general approach is to implement a Collector which can combine the processing of two collect operations based on a condition. 如果您想要一个干净,并行友好的解决方案,一种通用的方法是实现一个Collector ,它可以根据条件组合两个收集操作的处理。 This requires that you are able to express the downstream operation as a collector, but most stream operations can be expressed as collector (and the trend is going towards the possibility to express all operation that way, ie Java 9 will add the currently missing filtering and flatMapping . 这要求您能够将下游操作表示为收集器,但大多数流操作可以表示为收集器(并且趋势可能以这种方式表达所有操作,即Java 9将添加当前缺少的filteringflatMapping

You'll need a pair type to hold two results, so assuming a sketch like 你需要一个对类型来保存两个结果,所以假设一个草图

class Pair<A,B> {
    final A a;
    final B b;
    Pair(A a, B b) {
        this.a=a;
        this.b=b;
    }
}

the combining collector implementation will look like 组合收集器实现看起来像

public static <T, A1, A2, R1, R2> Collector<T, ?, Pair<R1,R2>> conditional(
        Predicate<? super T> predicate,
        Collector<T, A1, R1> whenTrue, Collector<T, A2, R2> whenFalse) {
    Supplier<A1> s1=whenTrue.supplier();
    Supplier<A2> s2=whenFalse.supplier();
    BiConsumer<A1, T> a1=whenTrue.accumulator();
    BiConsumer<A2, T> a2=whenFalse.accumulator();
    BinaryOperator<A1> c1=whenTrue.combiner();
    BinaryOperator<A2> c2=whenFalse.combiner();
    Function<A1,R1> f1=whenTrue.finisher();
    Function<A2,R2> f2=whenFalse.finisher();
    return Collector.of(
        ()->new Pair<>(s1.get(), s2.get()),
        (p,t)->{
            if(predicate.test(t)) a1.accept(p.a, t); else a2.accept(p.b, t);
        },
        (p1,p2)->new Pair<>(c1.apply(p1.a, p2.a), c2.apply(p1.b, p2.b)),
        p -> new Pair<>(f1.apply(p.a), f2.apply(p.b)));
}

and can be used, for example for collecting matching items into a list and counting the non-matching, like this: 并且可以用于例如将匹配项目收集到列表中并计算不匹配项,如下所示:

Pair<List<String>, Long> p = Files.lines(path)
  .collect(conditional(line -> isTypeA(line), Collectors.toList(), Collectors.counting()));
List<String> matching=p.a;
long nonMatching=p.b;

The collector is parallel friendly and allows arbitrarily complex delegate collectors, but note that with the current implementation, the stream returned by Files.lines might not perform so well with parallel processing, compare to “Reader#lines() parallelizes badly due to nonconfigurable batch size policy in its spliterator” . 收集器是并行友好的,并且允许任意复杂的委托收集器,但请注意,对于当前实现, Files.lines返回的流可能在并行处理方面表现不佳,与“Reader#lines()相比, 由于不可配置的批处理而并行化很差分裂者中的规模政策“ Improvements are scheduled for the Java 9 release. Java 9发行版计划进行了改进。

The way I'd deal with this is not to split this up at all, but rather, write 我处理这个问题的方法不是将它分开,而是写下来

Files.lines(path)
   .map(line -> {
      if (condition(line)) {
        return doThingA(line);
      } else {
        return doThingB(line);
      }
   })...

Details vary depending on exactly what you want to do and how you plan to do it. 细节取决于您想要做什么以及您打算如何做。

Well, you can simply do 好吧,你可以干脆做

Counter counter = new Counter();
File.lines(path)
    .forEach(line -> {
        if (isTypeA(line)) {
            processTypeA(line);
        }
        else {
            counter.increment();
        }
    });

Not very functional-style, but it does it in a similar way as your example. 不是很实用的风格,但它以与你的例子类似的方式实现。 Of course, if parallel, both Counter.increment() and processTypeA() have to be thread-safe. 当然,如果是并行的, Counter.increment()processTypeA()都必须是线程安全的。

Here's an approach (which ignores the cautions about forcing conditional processing into a stream) that wraps a predicate and consumer into a single predicate-with-side-effect: 这是一种方法(忽略了强制将条件处理转换为流的注意事项),它将谓词和使用者包装成单个谓词副作用:

public static class StreamProc {

    public static <T> Predicate<T> process( Predicate<T> condition, Consumer<T> operation ) {
        Predicate<T> p = t -> { operation.accept(t); return false; };
        return (t) -> condition.test(t) ? p.test(t) : true;
    }

}

Then filter the stream: 然后过滤流:

someStream
    .filter( StreamProc.process( cond1, op1 ) )
    .filter( StreamProc.process( cond2, op2 ) )
    ...
    .collect( ... )

Elements remaining in the stream have not yet been processed. 流中剩余的元素尚未处理。

For example, a typical filesystem traversal using external iteration looks like 例如,使用外部迭代的典型文件系统遍历如下所示

File[] files = dir.listFiles();
for ( File f : files ) {
    if ( f.isDirectory() ) {
        this.processDir( f );
    } else if ( f.isFile() ) {
        this.processFile( f );
    } else {
        this.processErr( f );
    }
}

With streams and internal iteration this becomes 随着流和内部迭代,这变成了

Arrays.stream( dir.listFiles() )
    .filter( StreamProc.process( f -> f.isDirectory(), this::processDir ) )
    .filter( StreamProc.process( f -> f.isFile(), this::processFile ) )
    .forEach( f -> this::processErr );

I would like Stream to implement the process method directly. 我想Stream直接实现流程方法。 Then we could have 那我们就可以了

Arrays.stream( dir.listFiles() )
    .process( f -> f.isDirectory(), this::processDir ) )
    .process( f -> f.isFile(), this::processFile ) )
    .forEach( f -> this::processErr );

Thoughts? 思考?

It seems that in reality you do want to process each line, but process it differently based on some condition (type). 看来实际上你确实希望处理每一行,但是根据某些条件(类型)对它进行不同的处理。

I think this is more or less functional way to implement it would be: 我认为这或多或少是实现它的功能方式:

public static void main(String[] args) {
    Arrays.stream(new int[] {1,2,3,4}).map(i -> processor(i).get()).forEach(System.out::println);
}

static Supplier<Integer> processor(int i) {
    return tellType(i) ? () -> processTypeA(i) : () -> processTypeB(i);
}

static boolean tellType(int i) {
    return i % 2 == 0;
}

static int processTypeA(int i) {
    return i * 100;
}

static int processTypeB(int i) {
    return i * 10;
}

@tom @汤姆

What about this:那这个呢:

Arrays.stream( dir.listFiles() )
    .peek(  f -> { if(f.isDirectory()) { processDir(f); }} )
    .peek(  f -> { if(f.isFile())      { processFile(f);}}) )
    .forEach( f -> this::processErr );

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM