简体   繁体   English

Java:对数据进行分块处理

[英]Java: stream processing of data which comes in chunks

Given we have some network or other process, which fetches data we need in chunks. 给定我们有一些网络或其他进程,这些进程会分块获取所需的数据。 Each chunk is an array of bytes. 每个块都是字节数组。 The nature of data is just a simple text file which consist of many lines. 数据的性质只是一个简单的文本文件,其中包含许多行。 We want to process this file line by line. 我们要逐行处理此文件。 Is this possible? 这可能吗?

A straightforward way to do this is to wait until all data comes, meanwhile adding all chunks to ByteBuffer, or simply merging them to one big byte array by System.arrayCopy. 一种简单的方法是等所有数据都到来,同时将所有块添加到ByteBuffer中,或者通过System.arrayCopy将它们简单地合并到一个大字节数组中。 After this we can create big String from this and read it line by line, or create ByteArrayInputStream and read it with some Reader after transforming to InputStreamReader. 之后,我们可以从中创建一个大String并逐行读取它,或者创建ByteArrayInputStream并在转换为InputStreamReader之后使用某些Reader对其进行读取。

OK, but can we do it in a real stream fashion, reading next chunk while it arrives? 可以,但是我们可以以实时流的方式进行操作,在到达时读取下一个块吗? No guarantees are made that chunk consists of some complete number of lines. 不能保证块由一些完整的行组成。 It can end in the middle of the line and this should be processed ie in this case we should wait for next chunk. 它可以在行的中间结束,应该对其进行处理,即在这种情况下,我们应该等待下一个块。

Is there a way to do this without waiting for the end of file? 有没有一种方法无需等待文件结束?

This isn't all that different from just reading from a BufferedReader; 这与从BufferedReader中读取并没有什么不同。 the difference is the BufferedReader doesn't buffer more data in the background as the current chunk is being processed; 不同之处在于,由于正在处理当前块,因此BufferedReader不会在后台缓冲更多数据; it waits until it's empty and you call some read() method. 它等待直到它为空,然后调用一些read()方法。 But if that's ok, wire a BufferedReader to you input and keep things simple. 但是,如果可以,请将BufferedReader连接到您的输入,并使事情保持简单。

If you need to read in parallel, Look into PipedInputStream/PipedOutputStream. 如果需要并行阅读,请查看PipedInputStream / PipedOutputStream。 They're paired, and the idea is to have one thread writing the data it reads from the stream to the PipedOutputStream, then another thread reads from the PipedInputStream. 它们是配对的,其想法是让一个线程将其从流读取的数据写入PipedOutputStream,然后另一个线程从PipedInputStream读取。

Or you can use non-blocking IO, but that involves saving the processing context so you can resume it later. 或者,您可以使用非阻塞IO,但这需要保存处理上下文,以便以后可以恢复它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM