简体繁体 English

将 `BufferedReader` 转换为 `Stream<String> ` 以并行方式

[英]Convert `BufferedReader` to `Stream<String>` in a parallel way

原文 2015-05-12 16:10:07 5 3 java/ java-8/ bufferedreader/ java-stream

Is there a way to receive a Stream<String> stream out of a BufferedReader reader such that each string in stream represents one line of reader with the additional condition that stream is provided directly (before reader read everything)?有没有办法从BufferedReader reader接收Stream<String> stream ，这样stream中的每个字符串都代表一行reader ，附加条件是直接提供stream （在reader读取所有内容之前）？ I want to process the data of stream parallel to getting them from reader to save time.我想并行处理stream的数据以从reader获取它们以节省时间。

Edit: I want to process the data parallel to reading.编辑：我想在阅读的同时处理数据。 I don't want to process different lines parallel.我不想并行处理不同的线。 They should be processed in order.它们应该按顺序处理。

Let's make an example on how I want to save time.让我们举一个例子来说明我想如何节省时间。 Let's say our reader will present 100 lines to us.假设我们的reader将向我们展示 100 行。 It takes 2 ms to read one line and 1 ms to process it.读取一行需要 2 毫秒，处理它需要 1 毫秒。 If I first read all the lines and then process them, it will take me 300 ms.如果我首先阅读所有行然后处理它们，我将花费 300 毫秒。 What I want to do is: As soon as a line is read I want to process it and parallel read the next line.我想要做的是：一旦读取一行，我想处理它并并行读取下一行。 The total time will then be 201 ms.总时间将是 201 毫秒。

What I don't like about BufferedReader.lines() : As far as I understood reading starts when I want to process the lines.我不喜欢BufferedReader.lines() ：据我所知，当我想处理这些行时，阅读就开始了。 Let's assume I have already my reader but have to do precomputations before being able to process the first line.让我们假设我已经有了我的reader但在能够处理第一行之前必须进行预计算。 Let's say they cost 30 ms.假设它们花费 30 毫秒。 In the above example the total time would then be 231 ms or 301 ms using reader.lines() (can you tell me which of those times is correct?).在上面的例子中，使用reader.lines()的总时间将是 231 毫秒或 301 毫秒（你能告诉我这些时间中哪个是正确的吗？）。 But it would be possible to get the job done in 201 ms, since the precomputations can be done parallel to reading the first 15 lines.但是有可能在 201 毫秒内完成工作，因为预计算可以与读取前 15 行并行完成。

3 个解决方案

You can use reader.lines().parallel() .您可以使用reader.lines().parallel() 。 This way your input will be split into chunks and further stream operations will be performed on chunks in parallel.通过这种方式，您的输入将被拆分为块，并且将在块上并行执行进一步的流操作。 If further operations take significant time, then you might get performance improvement.如果进一步的操作需要大量时间，那么您可能会获得性能改进。

In your case default heuristic will not work as you want and I guess there's no ready solution which will allow you to use single line batches.在您的情况下，默认启发式将无法正常工作，我想没有现成的解决方案可以让您使用单行批处理。 You can write a custom spliterator which will split after each line.您可以编写一个自定义拆分器，它将在每行之后拆分。 Look into java.util.Spliterators.AbstractSpliterator implementation.查看java.util.Spliterators.AbstractSpliterator实现。 Probably the easiest solution would be to write something similar, but limit batch sizes to one element in trySplit and read single line in tryAdvance method.可能最简单的解决方案是编写类似的内容，但将批次大小限制为trySplit一个元素，并在tryAdvance方法中读取单行。

为了做你想做的事情，你通常会有一个线程读取行并将它们添加到阻塞队列，另一个线程将从这个阻塞队列中获取行并处理它们。

You are looking at the wrong place.你看错地方了。 You are thinking that a stream of lines will read lines from the file but that's not how it works.您认为行流将从文件中读取行，但这不是它的工作原理。 You can't tell the underlying system to read a line as no-one knows what a line is before reading.您不能告诉底层系统读取一行，因为在读取之前没有人知道一行是什么。

A BufferedReader has it's name because of it's character buffer . BufferedReader之所以得名，是因为它是character buffer 。 This buffer has a default capacity of 8192. Whenever a new line is requested, the buffer will be parsed for a newline sequence and the part will be returned.此缓冲区的默认容量为 8192。每当请求换行时，缓冲区将被解析为换行序列并返回该部分。 When the buffer does not hold enough characters for finding a current line, the entire buffer will be filled .当缓冲区没有容纳足够的字符来查找当前行时，整个缓冲区将被填满。

Now, filling the buffer may lead to requests to read bytes from the underlying InputStream to fill the buffer of the character decoder .现在，填充缓冲区可能会导致请求从底层InputStream读取字节以填充字符解码器的缓冲区。 How many bytes will be requested and how many bytes will be actually read depends on the buffer size of the character decoder, on how much bytes of the actual encoding map to one character and whether the underlying InputStream has its own buffer and how big it is.将请求多少字节以及实际读取多少字节取决于字符解码器的缓冲区大小、实际编码映射到一个字符的字节数以及底层InputStream是否有自己的缓冲区以及它有多大.

The actual expensive operation is the reading of bytes from the underlying stream and there is no trivial mapping from line read requests to these read operations.实际开销较大的操作是从底层流中读取字节，并且没有从行读取请求到这些读取操作的简单映射。 Requesting the first line may cause reading, let's say one 16 KiB chunk from the underlying file, and the subsequent one hundred requests might be served from the filled buffer and cause no I/O at all.请求第一行可能会导致读取，假设来自底层文件的一个 16 KiB 块，随后的一百个请求可能会从已填充的缓冲区中提供服务，并且根本不会导致任何 I/O。 And nothing you do with the Stream API can change anything about that.您对Stream API 所做的任何事情都无法改变这一点。 The only thing you would parallelize is the search for new line characters in the buffer which is too trivial to benefit from parallel execution.您唯一需要并行化的是在缓冲区中搜索换行符，这太琐碎而无法从并行执行中受益。

You could reduce the buffer sizes of all involved parties to roughly get your intended parallel reading of one line while processing the previous line, however, that parallel execution will never compensate the performance degradation caused by the small buffer sizes.您可以减少所有相关方的缓冲区大小，以便在处理前一行时粗略地获得一行的预期并行读取，但是，并行执行永远不会补偿由小缓冲区大小引起的性能下降。