为什么执行java并行stream时list.size变了？

Question

考虑以下代码：

static void statefullParallelLambdaSet() {
    Set<Integer> s = new HashSet<>(
        Arrays.asList(1, 2, 3, 4, 5, 6)
    );
    
    List<Integer> list = new ArrayList<>();
    int sum = s.parallelStream().mapToInt(e -> {    // pipeline start
        if (list.size() <= 3) {     // list.size() changes while the pipeline operation is executing.
            list.add(e);            // mapToInt's lambda expression depends on this value, so it's stateful.
            return e;
        }
        else return 0;
    }).sum();   // terminal operation

    System.out.println(sum);
}

在上面的代码中，它说list.size()在 pipe 操作运行时发生变化，但我不明白。

由于list.add(e)是在多个线程中一次执行的，因为它是并行执行的，因此假设每次执行时值都会改变是否正确？

之所以连串stream执行时数值会变，是因为没有顺序因为是set，所以每次执行抽到的数字都不一样...

我对吗？

Answer 1

所以发生这种情况的原因是因为所谓的竞争条件 CPU 即使许多线程运行的进程也不仅仅是你的应用程序进程所以它可以解析和指令评估它然后必须跳下来为操作系统做一些事情然后来返回并且您的应用程序的另一个并行进程已经成功通过它，因为核心/超线程尚未从其工作中被窃取。

您可以在以下书籍中阅读有关竞争条件的信息： https://link.springer.com/referenceworkentry/10.1007/978-0-387-09766-4_36

但是你应该做的是在你正在改变的 memory 上实现锁，在 Java 你想看看java.util.concurrent.Locks https://www.baeldung.com/java-concurrent-锁

Answer 2

请注意，问题本身有点人为，因为通过并行化此任务不太可能获得显着的性能提升。

问题解释

您的代码通过副作用操作来累积结果， Stream API 文档不鼓励这样做。

您从上面的链接中偶然发现了第一个要点：

...无法保证：

这些副作用对其他线程的可见性；

ArrayList不是线程安全的集合，因此不能保证每个线程都观察到列表的相同 state。

另外，请注意map()操作（及其所有形式）并非旨在执行副作用，根据文档，它是 function 应该是无状态的：

mapper - 一个非干扰的、无状态的 function应用于每个元素

在这种情况下，合并处理先前 stream 元素的正确方法是定义一个Collector 。

为此，我们需要定义一个包含列表的可变容器

简而言之，Collector 可以实现为并发（即针对多线程环境进行了优化，以便所有线程都更新同一个可变容器）或非并发（每个线程创建自己的可变容器实例并填充它，然后每个线程产生的结果将被合并）。

为了实现并发收集器，我们需要提供一个线程安全的可变容器并指定一个特性CONCURRENT 。 如果查看List接口的实现，您会发现 JDK 提供的唯一选项是CopyOnWriteArrayList和过时Verctor 。

CopyOnWriteArrayList将是一个糟糕的选择，因为在幕后它会创建一个包含每个添加元素的新列表，这是一个关于如何获得OutOfMemoryError的方法。 此合集不适合频繁更新。

如果我们使用同步List ，它会在性能方面买任何东西，因为线程不能同时操作这个列表。 当一个线程正在添加一个元素时，其他线程被阻塞。 事实上，它会比顺序处理数据慢，因为同步是有代价的。

出于这个原因，在另一个答案中建议的锁定只会允许获得正确的结果，但您将无法从并行执行中受益。

我们可以做的是创建一个基于普通ArrayList的非并发收集器（即使用非线程安全容器的收集器）（它仍然可以与并行 stream 一起使用，每个线程将独立于一个单独的容器，没有锁定并遇到与并发相关的问题）。

非并发收集器

首先，我们需要定义一个自定义的累加类型，封装ArrayList和消耗元素的sum 。

为了创建一个收集器，我们需要使用 static 方法Collector.of() 。

集电极：

public static Collector<Integer, ?, IntSumContainer> toParallelIntSumContainer(int limit) {
    
    return Collector.of(
        () -> new IntSumContainer(limit),
        IntSumContainer::accept,
        IntSumContainer::merge
    );
}

自定义累积类型：

public class IntSumContainer implements IntConsumer {
    private int sum;
    private List<Integer> list = new ArrayList<>();
    private final int limit;

    public IntSumContainer(int limit) {
        this.limit = limit;
    }

    @Override
    public void accept(int value) {
        if (list.size() < limit) {
            list.add(value);
            sum += value;
        }
    }
    
    public IntSumContainer merge(IntSumContainer other) {
        other.list.stream().limit(limit - list.size()).forEach(this::accept); // there couldn't be issues related to concurrent access in the case, hence performing side-effects via forEach is safe 
        return this;
    }
    
    // getters
}

使用示例：

List<Integer> source = List.of(1, 2, 3, 4, 5, 6);

IntSumContainer result = s.parallelStream()
    .collect(toIntSumContainer(3));

List<Integer> list = result.getList();
int sum = result.getSum();

System.out.println(list);
System.out.println(sum);

Output：

[1, 2, 3]
6

并发收集器

由于您将HashSet用作 stream 源，它会生成无序的 stream，因此哪些元素将出现在生成的集合中并且对生成的总和有贡献可能并不重要。 并且由于您使用的是 Set，因此您也可以得到 Set 作为结果。

在这种情况下，我们可以利用 JDK 以ConcurrentHashMap的键视图的形式提供的并发 HashSet，可以通过 static 方法ConcurrentHashMap.newKeySet()获得。 ConcurrentHashMap的实现是无锁的。

为了同时累加和，我们可以使用LongAdder ，当由于不同步而需要频繁更新时（这里就是这种情况），我们可以使用比AtomicLong性能更高的 LongAdder。

与前面的示例一样，自定义累积类型将封装Set和已消耗元素的sum 。

在定义收集器时，为了使其并发，我们需要指定特征CONCURRENT ，并且UNORDERED也很方便，因为我们声明顺序并不重要。

集电极：

public static Collector<Integer, ?, ConcurrentIntSumContainer> toConcurrentIntSumContainer(int limit) {
    
    return Collector.of(
        () -> new ConcurrentIntSumContainer(limit),
        ConcurrentIntSumContainer::accept,
        (left, right) -> { throw new AssertionError("merge function is not expected be called by the Parallel collector"); },
        Collector.Characteristics.UNORDERED, Collector.Characteristics.CONCURRENT
    );
}

自定义累积类型：

public class ConcurrentIntSumContainer implements IntConsumer {
    private LongAdder sum = new LongAdder();
    private Set<Integer> set = ConcurrentHashMap.newKeySet();
    private final int limit;
    
    public ConcurrentIntSumContainer(int limit) {
        this.limit = limit;
    }
    
    @Override
    public void accept(int value) {
        if (set.size() < limit && set.add(value)) {
            sum.add(value);
        }
    }
    
    public Set<Integer> getSet() {
        return new HashSet<>(set); // because a general purpose set is faster than concurrent set
    }
    
    public long getSum() {
        return sum.sum();
    }
}

使用示例：

List<Integer> source = List.of(1, 2, 3, 4, 5, 6);

ConcurrentIntSumContainer result1 = source.parallelStream()
    .collect(toConcurrentIntSumContainer(3));
    
Set<Integer> set = result1.getSet();
long sum = result1.getSum();
    
System.out.println(set);
System.out.println(sum);

Output：

[1, 4, 5]
10

为什么执行java并行stream时list.size变了？

问题描述

2 个解决方案

解决方案1
0 2022-11-27 03:58:44

解决方案2
0 2022-11-27 14:13:01

问题解释

非并发收集器

并发收集器

为什么执行java并行stream时list.size变了？

问题描述

2 个解决方案

解决方案1 0 2022-11-27 03:58:44

解决方案2 0 2022-11-27 14:13:01

问题解释

非并发收集器

并发收集器

解决方案1
0 2022-11-27 03:58:44

解决方案2
0 2022-11-27 14:13:01