简体   繁体   English

使用java流并行收集(供应商,累加器,组合器)不给出预期的结果

[英]using java streams in parallel with collect(supplier, accumulator, combiner) not giving expected results

I'm trying to find number of words in given string. 我试图找到给定字符串中的单词数量。 Below is sequential algorithm for it which works fine. 下面是顺序算法,它工作正常。

public int getWordcount() {

        boolean lastSpace = true;
        int result = 0;

        for(char c : str.toCharArray()){
            if(Character.isWhitespace(c)){
                lastSpace = true;
            }else{
                if(lastSpace){
                    lastSpace = false;
                    ++result;
                }
            }
        }

        return result;

    }

But, when i tried to 'parallelize' this with Stream.collect(supplier, accumulator, combiner) method, i am getting wordCount = 0. I am using an immutable class (WordCountState) just to maintain the state of word count. 但是,当我试图用Stream.collect(供应商,累加器,组合器)方法“并行”时,我得到wordCount = 0.我使用不可变类(WordCountState)只是为了维持字数的状态。

Code : 代码:

public class WordCounter {
    private final String str = "Java8 parallelism  helps    if you know how to use it properly.";

public int getWordCountInParallel() {
        Stream<Character> charStream = IntStream.range(0, str.length())
                                                .mapToObj(i -> str.charAt(i));

        WordCountState finalState = charStream.parallel()                                             
                                              .collect(WordCountState::new,
                                                        WordCountState::accumulate,
                                                        WordCountState::combine);

        return finalState.getCounter();
    }
}

public class WordCountState {
    private final boolean lastSpace;
    private final int counter;
    private static int numberOfInstances = 0;

public WordCountState(){
        this.lastSpace = true;
        this.counter = 0;
        //numberOfInstances++;
    }

    public WordCountState(boolean lastSpace, int counter){
        this.lastSpace = lastSpace;
        this.counter = counter;
        //numberOfInstances++;
    }

//accumulator
    public WordCountState accumulate(Character c) {


        if(Character.isWhitespace(c)){
            return lastSpace ? this : new WordCountState(true, counter);
        }else{
            return lastSpace ? new WordCountState(false, counter + 1) : this;
        }   
    }

    //combiner
    public WordCountState combine(WordCountState wordCountState) {  
        //System.out.println("Returning new obj with count : " + (counter + wordCountState.getCounter()));
        return new WordCountState(this.isLastSpace(), 
                                    (counter + wordCountState.getCounter()));
    }

I've observed two issues with above code : 1. Number of objects (WordCountState) created are greater than number of characters in the string. 我已经观察到上述代码存在两个问题:1。创建的对象数(WordCountState)大于字符串中的字符数。 2. Result is always 0. 3. As per accumulator/consumer documentation, shouldn't the accumulator return void? 2.结果始终为0. 3.根据累加器/消费者文档,累加器不应返回void吗? Even though my accumulator method is returning an object, compiler doesn't complain. 即使我的累加器方法返回一个对象,编译器也不会抱怨。

Any clue where i might have gone off track? 任何线索,我可能会偏离轨道?

UPDATE : Used solution as below - 更新:使用的解决方案如下 -

public int getWordCountInParallel() {
        Stream<Character> charStream = IntStream.range(0, str.length())
                                                .mapToObj(i -> str.charAt(i));


        WordCountState finalState = charStream.parallel()
                                              .reduce(new WordCountState(),
                                                        WordCountState::accumulate,
                                                        WordCountState::combine);

        return finalState.getCounter();
    }

You can always invoke a method and ignore its return value, so it's logical to allow the same when using method references. 您始终可以调用方法并忽略其返回值,因此在使用方法引用时允许相同的方法是合乎逻辑的。 Therefore, it's no problem creating a method reference to a non- void method when a consumer is required, as long as the parameters match. 因此,只要参数匹配,当需要使用者时,创建对非void方法的方法引用是没有问题的。

What you have created with your immutable WordCountState class, is a reduction operation, ie it would support a use case like 你用不可变的WordCountState类创建的是一个简化操作,即它支持一个用例

Stream<Character> charStream = IntStream.range(0, str.length())
                                        .mapToObj(i -> str.charAt(i));

WordCountState finalState = charStream.parallel()
        .map(ch -> new WordCountState().accumulate(ch))
        .reduce(new WordCountState(), WordCountState::combine);

whereas the collect method supports the mutable reduction , where a container instance (may be identical to the result) gets modified. collect方法支持可变减少 ,其中容器实例(可能与结果相同)被修改。

There is still a logical error in your solution as each WordCountState instance starts with assuming to have a preceding space character, without knowing the actual situation and no attempt to fix this in the combiner. 您的解决方案中仍然存在逻辑错误,因为每个WordCountState实例都假设具有前面的空格字符,而不知道实际情况并且没有尝试在组合器中修复此问题。

A way to fix and simplify this, still using reduction, would be: 修复和简化这种方法的方法仍然是使用简化:

public int getWordCountInParallel() {
    return str.codePoints().parallel()
        .mapToObj(WordCountState::new)
        .reduce(WordCountState::new)
        .map(WordCountState::getResult).orElse(0);
}


public class WordCountState {
    private final boolean firstSpace, lastSpace;
    private final int counter;

    public WordCountState(int character){
        firstSpace = lastSpace = Character.isWhitespace(character);
        this.counter = 0;
    }

    public WordCountState(WordCountState a, WordCountState b) {
        this.firstSpace = a.firstSpace;
        this.lastSpace = b.lastSpace;
        this.counter = a.counter + b.counter + (a.lastSpace && !b.firstSpace? 1: 0);
    }
    public int getResult() {
        return counter+(firstSpace? 0: 1);
    }
}

If you are worrying about the number of WordCountState instances, note how many Character instances this solution does not create, compared to your initial approach. 如果您担心WordCountState实例的数量,请注意与初始方法相比,此解决方案不会创建多少个Character实例。

However, this task is indeed suitable for mutable reduction, if you rewrite your WordCountState to a mutable result container: 但是,如果将WordCountState重写为可变结果容器,则此任务确实适用于可变减少:

public int getWordCountInParallel() {
    return str.codePoints().parallel()
        .collect(WordCountState::new, WordCountState::accumulate, WordCountState::combine)
        .getResult();
}


public class WordCountState {
    private boolean firstSpace, lastSpace=true, initial=true;
    private int counter;

    public void accumulate(int character) {
        boolean white=Character.isWhitespace(character);
        if(lastSpace && !white) counter++;
        lastSpace=white;
        if(initial) {
            firstSpace=white;
            initial=false;
        }
    }
    public void combine(WordCountState b) {
        if(initial) {
            this.initial=b.initial;
            this.counter=b.counter;
            this.firstSpace=b.firstSpace;
            this.lastSpace=b.lastSpace;
        }
        else if(!b.initial) {
            this.counter += b.counter;
            if(!lastSpace && !b.firstSpace) counter--;
            this.lastSpace = b.lastSpace;
        }
    }
    public int getResult() {
        return counter;
    }
}

Note how using int to represent unicode characters consistently, allows to use the codePoint() stream of a CharSequence , which is not only simpler, but also handles characters outside the Basic Multilingual Plane and is potentially more efficient, as it doesn't need boxing to Character instances. 注意如何使用int来一致地表示unicode字符,允许使用CharSequencecodePoint()流,它不仅更简单,而且还处理Basic Multilingual Plane之外的字符,并且可能更有效,因为它不需要装箱到Character实例。

When you implemented stream().collect(supplier, accumulator, combiner) they do return void (combiner and accumulator). 当你实现stream().collect(supplier, accumulator, combiner)它们会返回void (组合器和累加器)。 The problem is that this: 问题是这个:

  collect(WordCountState::new,
          WordCountState::accumulate,
          WordCountState::combine)

In your case actually means (just the accumulator, but same goes for the combiner): 在你的情况下实际上意味着(只是累加器,但组合器也一样):

     (wordCounter, character) -> {
              WordCountState state = wc.accumulate(c);
              return;
     }

And this is not trivial to get indeed. 事实并非如此。 Let's say we have two methods: 假设我们有两种方法:

public void accumulate(Character c) {
    if (!Character.isWhitespace(c)) {
        counter++;
    }
}

public WordCountState accumulate2(Character c) {
    if (Character.isWhitespace(c)) {
        return lastSpace ? this : new WordCountState(true, counter);
    } else {
        return lastSpace ? new WordCountState(false, counter + 1) : this;
    }
}

For the them the below code will work just fine , BUT only for a method reference , not for lambda expressions. 对于它们,下面的代码将正常工作 ,但仅适用于方法引用 ,而不适用于lambda表达式。

BiConsumer<WordCountState, Character> cons = WordCountState::accumulate;

BiConsumer<WordCountState, Character> cons2 = WordCountState::accumulate2;

You can imagine it slightly different, via an class that implementes BiConsumer for example: 您可以通过implementes BiConsumer的类来想象它略有不同:

 BiConsumer<WordCountState, Character> clazz = new BiConsumer<WordCountState, Character>() {
        @Override
        public void accept(WordCountState state, Character character) {
            WordCountState newState = state.accumulate2(character);
            return;
        }
    };

As such your combine and accumulate methods needs to change to: 因此,您的combineaccumulate方法需要更改为:

public void combine(WordCountState wordCountState) {
    counter = counter + wordCountState.getCounter();
}


public void accumulate(Character c) {
    if (!Character.isWhitespace(c)) {
        counter++;
    }
}

First of all, would it not be easier to just use something like input.split("\\\\s+").length to get the word count? 首先,使用input.split("\\\\s+").length来获取单词计数会不会更容易?

In case this is an exercise in streams and collectors, let's discuss your implementation. 如果这是溪流和收藏家的练习,我们来讨论你的实施。 The biggest mistake was pointed out by you already: Your accumulator and combiner should not return new instances. 你已经指出了最大的错误:你的累加器和组合器不应该返回新的实例。 The signature of collect tells you that it expects BiConsumer , which do not return anything. collect的签名告诉你它期望BiConsumer ,它不返回任何东西。 Because you create new object in the accumulator, you never increase the count of the WordCountState objects your collector actually uses. 因为您在累加器中创建新对象,所以永远不会增加收集器实际使用的WordCountState对象的数量。 And by creating a new object in the combiner you would discard any progress you would have made. 通过在合并器中创建一个新对象,您可以放弃您可能取得的任何进展。 This is also why you create more objects than characters in your input: one per character, and then some for the return values. 这也是为什么你创建的对象多于输入中的字符的原因:每个字符一个,然后一些返回值。

See this adapted implementation: 看到这个改编的实施:

public static class WordCountState
{
    private boolean lastSpace = true;
    private int     counter   = 0;

    public void accumulate(Character character)
    {
        if (!Character.isWhitespace(character))
        {
            if (lastSpace)
            {
                counter++;
            }
            lastSpace = false;
        }
        else
        {
            lastSpace = true;
        }
    }

    public void combine(WordCountState wordCountState)
    {
        counter += wordCountState.counter;
    }
}

Here, we do not create new objects in every step, but change the state of the ones we have. 在这里,我们不会在每个步骤中创建新对象,而是更改我们所拥有的对象的状态。 I think you tried to create new objects because your Elvis operators forced you to return something and/or you couldn't change the instance fields as they are final. 我认为您尝试创建新对象,因为您的Elvis操作员强迫您返回某些内容和/或您无法更改实例字段,因为它们是最终的。 They do not need to be final, though, and you can easily change them. 但是,它们不需要是最终的,您可以轻松地更改它们。

Running this adapted implementation sequentially now works fine, as we nicely look at the chars one by one and end up with 11 words. 顺序运行这个改编的实现现在工作正常,因为我们很好地逐个查看字符,最后得到11个单词。

In parallel, though, it fails. 但同时,它失败了。 It seems it creates a new WordCountState for every char, but does not count all of them, and ends up at 29 (at least for me). 它似乎为每个char创建一个新的WordCountState ,但不计算所有这些,最终在29(至少对我而言)。 This shows a basic flaw with your algorithm: Splitting on every character doesn't work in parallel. 这显示了算法的一个基本缺陷:拆分每个字符并不起作用。 Imagine the input abc abc , which should result in 2. If you do it in parallel and do not specify how to split the input, you might end up with these chunks: ab, ca, bc , which would add up to 4. 想象一下输入abc abc ,它应该导致2.如果你并行执行并且没有指定如何拆分输入,你最终可能会得到这些块: ab, ca, bc ,最多可以加4。

The problem is that by parallelizing between characters (ie in the middle of words), you make your separate WordCountState s dependent on each other (because they would need to know which one come before them and whether it ended with a whitespace char). 问题在于,通过在字符之间进行并行化(即在单词的中间),可以使单独的WordCountState相互依赖(因为它们需要知道哪一个在它们之前,以及它是否以空白字符结尾)。 This defeats the parallelism and results in errors. 这会破坏并行性并导致错误。

Aside from all that, it might be easier to implement the Collector interface instead of providing the three methods: 除此之外,实现Collector接口可能更容易,而不是提供三种方法:

public static class WordCountCollector
    implements Collector<Character, SimpleEntry<AtomicInteger, Boolean>, Integer>
{
    @Override
    public Supplier<SimpleEntry<AtomicInteger, Boolean>> supplier()
    {
        return () -> new SimpleEntry<>(new AtomicInteger(0), true);
    }

    @Override
    public BiConsumer<SimpleEntry<AtomicInteger, Boolean>, Character> accumulator()
    {
        return (count, character) -> {
            if (!Character.isWhitespace(character))
            {
                if (count.getValue())
                {
                    String before = count.getKey().get() + " -> ";
                    count.getKey().incrementAndGet();
                    System.out.println(before + count.getKey().get());
                }
                count.setValue(false);
            }
            else
            {
                count.setValue(true);
            }
        };
    }

    @Override
    public BinaryOperator<SimpleEntry<AtomicInteger, Boolean>> combiner()
    {
        return (c1, c2) -> new SimpleEntry<>(new AtomicInteger(c1.getKey().get() + c2.getKey().get()), false);
    }

    @Override
    public Function<SimpleEntry<AtomicInteger, Boolean>, Integer> finisher()
    {
        return count -> count.getKey().get();
    }

    @Override
    public Set<java.util.stream.Collector.Characteristics> characteristics()
    {
        return new HashSet<>(Arrays.asList(Characteristics.CONCURRENT, Characteristics.UNORDERED));
    }
}

We use a pair ( SimpleEntry ) to keep the count and the knowledge about the last space. 我们使用一对( SimpleEntry )来保持计数和最后一个空间的知识。 This way, we do not need to implement the state in the collector itself or write a param object for it. 这样,我们不需要在收集器本身中实现状态或为它编写param对象。 You can use this collector like this: 您可以像这样使用此收集器:

return charStream.parallel().collect(new WordCountCollector());

This collector parallelizes nicer than the initial implementation, but still varies in results (mostly between 14 and 16) because of the mentioned weaknesses in your approach. 这个收集器并行化比初始实现更好,但结果仍然不同(大多数在14到16之间),因为你所采用的方法存在缺点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM