帶批處理的 Java 8 Stream

Question

我有一個包含項目列表的大文件。

我想創建一批項目，使用該批次發出 HTTP 請求（所有項目都需要作為 HTTP 請求中的參數）。 我可以使用for循環輕松完成，但作為 Java 8 愛好者，我想嘗試使用 Java 8 的 Stream 框架編寫它（並獲得延遲處理的好處）。

例子：

List<String> batch = new ArrayList<>(BATCH_SIZE);
for (int i = 0; i < data.size(); i++) {
  batch.add(data.get(i));
  if (batch.size() == BATCH_SIZE) process(batch);
}

if (batch.size() > 0) process(batch);

我想做一些很長的lazyFileStream.group(500).map(processBatch).collect(toList())

什么是最好的方法來做到這一點？

Answer 1

為了完整起見，這是一個番石榴解決方案。

Iterators.partition(stream.iterator(), batchSize).forEachRemaining(this::process);

在這個問題中，集合可用，因此不需要流，它可以寫為，

Iterables.partition(data, batchSize).forEach(this::process);

Answer 2

純 Java-8 實現也是可能的：

int BATCH = 500;
IntStream.range(0, (data.size()+BATCH-1)/BATCH)
         .mapToObj(i -> data.subList(i*BATCH, Math.min(data.size(), (i+1)*BATCH)))
         .forEach(batch -> process(batch));

請注意，與 JOOl 不同的是，它可以很好地並行工作（前提是您的data是隨機訪問列表）。

Answer 3

純 Java 8 解決方案：

我們可以創建一個自定義收集器來優雅地做到這一點，它接受一個batch size和一個Consumer來處理每個批次：

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Set;
import java.util.function.*;
import java.util.stream.Collector;

import static java.util.Objects.requireNonNull;


/**
 * Collects elements in the stream and calls the supplied batch processor
 * after the configured batch size is reached.
 *
 * In case of a parallel stream, the batch processor may be called with
 * elements less than the batch size.
 *
 * The elements are not kept in memory, and the final result will be an
 * empty list.
 *
 * @param <T> Type of the elements being collected
 */
class BatchCollector<T> implements Collector<T, List<T>, List<T>> {

    private final int batchSize;
    private final Consumer<List<T>> batchProcessor;


    /**
     * Constructs the batch collector
     *
     * @param batchSize the batch size after which the batchProcessor should be called
     * @param batchProcessor the batch processor which accepts batches of records to process
     */
    BatchCollector(int batchSize, Consumer<List<T>> batchProcessor) {
        batchProcessor = requireNonNull(batchProcessor);

        this.batchSize = batchSize;
        this.batchProcessor = batchProcessor;
    }

    public Supplier<List<T>> supplier() {
        return ArrayList::new;
    }

    public BiConsumer<List<T>, T> accumulator() {
        return (ts, t) -> {
            ts.add(t);
            if (ts.size() >= batchSize) {
                batchProcessor.accept(ts);
                ts.clear();
            }
        };
    }

    public BinaryOperator<List<T>> combiner() {
        return (ts, ots) -> {
            // process each parallel list without checking for batch size
            // avoids adding all elements of one to another
            // can be modified if a strict batching mode is required
            batchProcessor.accept(ts);
            batchProcessor.accept(ots);
            return Collections.emptyList();
        };
    }

    public Function<List<T>, List<T>> finisher() {
        return ts -> {
            batchProcessor.accept(ts);
            return Collections.emptyList();
        };
    }

    public Set<Characteristics> characteristics() {
        return Collections.emptySet();
    }
}

（可選）然后創建一個輔助實用程序類：

import java.util.List;
import java.util.function.Consumer;
import java.util.stream.Collector;

public class StreamUtils {

    /**
     * Creates a new batch collector
     * @param batchSize the batch size after which the batchProcessor should be called
     * @param batchProcessor the batch processor which accepts batches of records to process
     * @param <T> the type of elements being processed
     * @return a batch collector instance
     */
    public static <T> Collector<T, List<T>, List<T>> batchCollector(int batchSize, Consumer<List<T>> batchProcessor) {
        return new BatchCollector<T>(batchSize, batchProcessor);
    }
}

用法示例：

List<Integer> input = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
List<Integer> output = new ArrayList<>();

int batchSize = 3;
Consumer<List<Integer>> batchProcessor = xs -> output.addAll(xs);

input.stream()
     .collect(StreamUtils.batchCollector(batchSize, batchProcessor));

我也在 GitHub 上發布了我的代碼，如果有人想看一看：

鏈接到 Github

Answer 4

我為這樣的場景編寫了一個自定義的 Spliterator。 它將從輸入流中填充給定大小的列表。 這種方法的優點是它將執行延遲處理，並且可以與其他流函數一起使用。

public static <T> Stream<List<T>> batches(Stream<T> stream, int batchSize) {
    return batchSize <= 0
        ? Stream.of(stream.collect(Collectors.toList()))
        : StreamSupport.stream(new BatchSpliterator<>(stream.spliterator(), batchSize), stream.isParallel());
}

private static class BatchSpliterator<E> implements Spliterator<List<E>> {

    private final Spliterator<E> base;
    private final int batchSize;

    public BatchSpliterator(Spliterator<E> base, int batchSize) {
        this.base = base;
        this.batchSize = batchSize;
    }

    @Override
    public boolean tryAdvance(Consumer<? super List<E>> action) {
        final List<E> batch = new ArrayList<>(batchSize);
        for (int i=0; i < batchSize && base.tryAdvance(batch::add); i++)
            ;
        if (batch.isEmpty())
            return false;
        action.accept(batch);
        return true;
    }

    @Override
    public Spliterator<List<E>> trySplit() {
        if (base.estimateSize() <= batchSize)
            return null;
        final Spliterator<E> splitBase = this.base.trySplit();
        return splitBase == null ? null
                : new BatchSpliterator<>(splitBase, batchSize);
    }

    @Override
    public long estimateSize() {
        final double baseSize = base.estimateSize();
        return baseSize == 0 ? 0
                : (long) Math.ceil(baseSize / (double) batchSize);
    }

    @Override
    public int characteristics() {
        return base.characteristics();
    }

}

Answer 5

我們有一個類似的問題需要解決。 我們想要一個大於系統內存的流（遍歷數據庫中的所有對象）並盡可能地隨機化順序——我們認為緩沖 10,000 個項目並隨機化它們是可以的。

目標是一個接收流的函數。

在這里提出的解決方案中，似乎有多種選擇：

使用各種非 java 8 附加庫
從不是流的東西開始 - 例如隨機訪問列表
有一個可以在拆分器中輕松拆分的流

我們的本能最初是使用自定義收集器，但這意味着退出流媒體。 上面的自定義收集器解決方案非常好，我們幾乎使用了它。

這是一個解決方案，它利用Stream可以為您提供一個Iterator的事實來作弊，您可以將其用作逃生艙，讓您執行一些流不支持的額外操作。 Iterator使用另一個 Java 8 StreamSupport法術轉換回流。

/**
 * An iterator which returns batches of items taken from another iterator
 */
public class BatchingIterator<T> implements Iterator<List<T>> {
    /**
     * Given a stream, convert it to a stream of batches no greater than the
     * batchSize.
     * @param originalStream to convert
     * @param batchSize maximum size of a batch
     * @param <T> type of items in the stream
     * @return a stream of batches taken sequentially from the original stream
     */
    public static <T> Stream<List<T>> batchedStreamOf(Stream<T> originalStream, int batchSize) {
        return asStream(new BatchingIterator<>(originalStream.iterator(), batchSize));
    }

    private static <T> Stream<T> asStream(Iterator<T> iterator) {
        return StreamSupport.stream(
            Spliterators.spliteratorUnknownSize(iterator,ORDERED),
            false);
    }

    private int batchSize;
    private List<T> currentBatch;
    private Iterator<T> sourceIterator;

    public BatchingIterator(Iterator<T> sourceIterator, int batchSize) {
        this.batchSize = batchSize;
        this.sourceIterator = sourceIterator;
    }

    @Override
    public boolean hasNext() {
        prepareNextBatch();
        return currentBatch!=null && !currentBatch.isEmpty();
    }

    @Override
    public List<T> next() {
        return currentBatch;
    }

    private void prepareNextBatch() {
        currentBatch = new ArrayList<>(batchSize);
        while (sourceIterator.hasNext() && currentBatch.size() < batchSize) {
            currentBatch.add(sourceIterator.next());
        }
    }
}

使用它的一個簡單示例如下所示：

@Test
public void getsBatches() {
    BatchingIterator.batchedStreamOf(Stream.of("A","B","C","D","E","F"), 3)
        .forEach(System.out::println);
}

以上印

[A, B, C]
[D, E, F]

對於我們的用例，我們想對批次進行洗牌，然后將它們作為流保留 - 它看起來像這樣：

@Test
public void howScramblingCouldBeDone() {
    BatchingIterator.batchedStreamOf(Stream.of("A","B","C","D","E","F"), 3)
        // the lambda in the map expression sucks a bit because Collections.shuffle acts on the list, rather than returning a shuffled one
        .map(list -> {
            Collections.shuffle(list); return list; })
        .flatMap(List::stream)
        .forEach(System.out::println);
}

這會輸出類似的東西（它是隨機的，每次都不同）

A
C
B
E
D
F

這里的秘訣是總是有一個流，所以你可以對一個批次的流進行操作，或者對每個批次做一些事情，然后將它flatMap回一個流。 更好的是，上述所有內容僅作為最終的forEach或collect或其他終止表達式運行，通過流拉取數據。

事實證明， iterator是一種特殊類型的流終止操作，不會導致整個流運行並進入內存！ 感謝 Java 8 的出色設計！

Answer 6

筆記！ 此解決方案在運行 forEach 之前讀取整個文件。

您可以使用jOOλ 來實現，這是一個為單線程、順序流用例擴展 Java 8 流的庫：

Seq.seq(lazyFileStream)              // Seq<String>
   .zipWithIndex()                   // Seq<Tuple2<String, Long>>
   .groupBy(tuple -> tuple.v2 / 500) // Map<Long, List<String>>
   .forEach((index, batch) -> {
       process(batch);
   });

在幕后， zipWithIndex()只是：

static <T> Seq<Tuple2<T, Long>> zipWithIndex(Stream<T> stream) {
    final Iterator<T> it = stream.iterator();

    class ZipWithIndex implements Iterator<Tuple2<T, Long>> {
        long index;

        @Override
        public boolean hasNext() {
            return it.hasNext();
        }

        @Override
        public Tuple2<T, Long> next() {
            return tuple(it.next(), index++);
        }
    }

    return seq(new ZipWithIndex());
}

...而groupBy()是 API 方便：

default <K> Map<K, List<T>> groupBy(Function<? super T, ? extends K> classifier) {
    return collect(Collectors.groupingBy(classifier));
}

（免責聲明：我為 jOOλ 背后的公司工作）

Answer 7

您還可以使用RxJava ：

Observable.from(data).buffer(BATCH_SIZE).forEach((batch) -> process(batch));

或者

Observable.from(lazyFileStream).buffer(500).map((batch) -> process(batch)).toList();

或者

Observable.from(lazyFileStream).buffer(500).map(MyClass::process).toList();

Answer 8

你也可以看看cyclops-react ，我是這個庫的作者。 它實現了 jOOλ 接口（以及 JDK 8 Streams 的擴展），但與 JDK 8 Parallel Streams 不同，它專注於異步操作（例如可能阻塞異步 I/O 調用）。 相比之下，JDK Parallel Streams 專注於 CPU 綁定操作的數據並行性。 它通過在后台管理基於 Future 的任務的聚合來工作，但向最終用戶提供標准的擴展 Stream API。

此示例代碼可能會幫助您入門

LazyFutureStream.parallelCommonBuilder()
                .react(data)
                .grouped(BATCH_SIZE)                  
                .map(this::process)
                .run();

這里有關於批處理的教程

和一個更一般的教程在這里

要使用您自己的線程池（這可能更適合阻塞 I/O），您可以開始處理

     LazyReact reactor = new LazyReact(40);

     reactor.react(data)
            .grouped(BATCH_SIZE)                  
            .map(this::process)
            .run();

Answer 9

也適用於並行流的純 Java 8 示例。

如何使用：

Stream<Integer> integerStream = IntStream.range(0, 45).parallel().boxed();
CsStreamUtil.processInBatch(integerStream, 10, batch -> System.out.println("Batch: " + batch));

方法聲明和實現：

public static <ElementType> void processInBatch(Stream<ElementType> stream, int batchSize, Consumer<Collection<ElementType>> batchProcessor)
{
    List<ElementType> newBatch = new ArrayList<>(batchSize);

    stream.forEach(element -> {
        List<ElementType> fullBatch;

        synchronized (newBatch)
        {
            if (newBatch.size() < batchSize)
            {
                newBatch.add(element);
                return;
            }
            else
            {
                fullBatch = new ArrayList<>(newBatch);
                newBatch.clear();
                newBatch.add(element);
            }
        }

        batchProcessor.accept(fullBatch);
    });

    if (newBatch.size() > 0)
        batchProcessor.accept(new ArrayList<>(newBatch));
}

Answer 10

平心而論，看看優雅的Vavr解決方案：

Stream.ofAll(data).grouped(BATCH_SIZE).forEach(this::process);

Answer 11

使用 Spliterator 的簡單示例

    // read file into stream, try-with-resources
    try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
        //skip header
        Spliterator<String> split = stream.skip(1).spliterator();
        Chunker<String> chunker = new Chunker<String>();
        while(true) {              
            boolean more = split.tryAdvance(chunker::doSomething);
            if (!more) {
                break;
            }
        }           
    } catch (IOException e) {
        e.printStackTrace();
    }
}

static class Chunker<T> {
    int ct = 0;
    public void doSomething(T line) {
        System.out.println(ct++ + " " + line.toString());
        if (ct % 100 == 0) {
            System.out.println("====================chunk=====================");               
        }           
    }       
}

布魯斯的回答更全面，但我一直在尋找快速而骯臟的東西來處理一堆文件。

Answer 12

這是一個懶惰評估的純Java解決方案。

public static <T> Stream<List<T>> partition(Stream<T> stream, int batchSize){
    List<List<T>> currentBatch = new ArrayList<List<T>>(); //just to make it mutable 
    currentBatch.add(new ArrayList<T>(batchSize));
    return Stream.concat(stream
      .sequential()                   
      .map(new Function<T, List<T>>(){
          public List<T> apply(T t){
              currentBatch.get(0).add(t);
              return currentBatch.get(0).size() == batchSize ? currentBatch.set(0,new ArrayList<>(batchSize)): null;
            }
      }), Stream.generate(()->currentBatch.get(0).isEmpty()?null:currentBatch.get(0))
                .limit(1)
    ).filter(Objects::nonNull);
}

Answer 13

您可以使用 apache.commons ：

ListUtils.partition(ListOfLines, 500).stream()
                .map(partition -> processBatch(partition)
                .collect(Collectors.toList());

分區部分是非惰性完成的，但是在對列表進行分區之后，您將獲得使用流的好處（例如，使用並行流、添加過濾器等）。 其他答案提出了更詳細的解決方案，但有時可讀性和可維護性更重要（有時它們不是:-)）

Answer 14

使用Reactor可以輕松完成：

Flux.fromStream(fileReader.lines().onClose(() -> safeClose(fileReader)))
            .map(line -> someProcessingOfSingleLine(line))
            .buffer(BUFFER_SIZE)
            .subscribe(apiService::makeHttpRequest);

Answer 15

使用Java 8和com.google.common.collect.Lists ，您可以執行以下操作：

public class BatchProcessingUtil {
    public static <T,U> List<U> process(List<T> data, int batchSize, Function<List<T>, List<U>> processFunction) {
        List<List<T>> batches = Lists.partition(data, batchSize);
        return batches.stream()
                .map(processFunction) // Send each batch to the process function
                .flatMap(Collection::stream) // flat results to gather them in 1 stream
                .collect(Collectors.toList());
    }
}

這里T是輸入列表中項目的類型， U是輸出列表中項目的類型

你可以這樣使用它：

List<String> userKeys = [... list of user keys]
List<Users> users = BatchProcessingUtil.process(
    userKeys,
    10, // Batch Size
    partialKeys -> service.getUsers(partialKeys)
);

帶批處理的 Java 8 Stream

問題描述

15 個解決方案

解決方案1
138 2015-07-26 22:02:23

解決方案2
69 2015-06-07 14:48:14

解決方案3
42 2016-08-20 21:50:57

解決方案4
19 2017-01-19 17:45:01

解決方案5
15 2017-03-01 12:22:48

解決方案6
13 已采納 2015-06-05 09:09:05

解決方案7
10 2015-07-03 15:19:31

解決方案8
8 2015-07-03 13:42:10

解決方案9
3 2018-08-20 16:53:31

解決方案10
3 2020-05-04 16:40:59

解決方案11
1 2017-09-21 14:50:25

解決方案12
1 2018-10-11 03:03:34

解決方案13
1 2019-06-16 20:01:43

解決方案14
1 2020-03-24 18:29:59

解決方案15
0 2019-04-17 22:07:58

帶批處理的 Java 8 Stream

問題描述

15 個解決方案

解決方案1 138 2015-07-26 22:02:23

解決方案2 69 2015-06-07 14:48:14

解決方案3 42 2016-08-20 21:50:57

解決方案4 19 2017-01-19 17:45:01

解決方案5 15 2017-03-01 12:22:48

解決方案6 13 已采納 2015-06-05 09:09:05

解決方案7 10 2015-07-03 15:19:31

解決方案8 8 2015-07-03 13:42:10

解決方案9 3 2018-08-20 16:53:31

解決方案10 3 2020-05-04 16:40:59

解決方案11 1 2017-09-21 14:50:25

解決方案12 1 2018-10-11 03:03:34

解決方案13 1 2019-06-16 20:01:43

解決方案14 1 2020-03-24 18:29:59

解決方案15 0 2019-04-17 22:07:58

解決方案1
138 2015-07-26 22:02:23

解決方案2
69 2015-06-07 14:48:14

解決方案3
42 2016-08-20 21:50:57

解決方案4
19 2017-01-19 17:45:01

解決方案5
15 2017-03-01 12:22:48

解決方案6
13 已采納 2015-06-05 09:09:05

解決方案7
10 2015-07-03 15:19:31

解決方案8
8 2015-07-03 13:42:10

解決方案9
3 2018-08-20 16:53:31

解決方案10
3 2020-05-04 16:40:59

解決方案11
1 2017-09-21 14:50:25

解決方案12
1 2018-10-11 03:03:34

解決方案13
1 2019-06-16 20:01:43

解決方案14
1 2020-03-24 18:29:59

解決方案15
0 2019-04-17 22:07:58