简体   繁体   English

Java IO:读取仍在写入的文件

[英]Java IO: Reading a file that is still being written

I am creating a program which needs to read from a file that is still being written. 我正在创建一个程序,该程序需要从仍在写入的文件中读取。

The main question is this: If the read and write will be performed using InputStream and OutputStream classes running on a separate thread, what are the catches and edge cases that I will need to be aware of in order to prevent data corruption? 主要问题是:如果将使用在单独线程上运行的InputStreamOutputStream类执行读取和写入操作, 那么为了防止数据损坏,我需要注意哪些catch和edge情况?

In case anyone is wondering if I have considered other, non- InputStream based approach, the answer is yes, I have but unfortunately it's not possible in this project since the program uses libraries that only works with InputStream and OutputStream . 如果有人想知道我是否考虑过其他基于非InputStream的方法,答案是肯定的,但不幸的是,在此项目中这是不可能的,因为该程序使用仅适用于InputStreamOutputStream

Also, several readers have asked why this complications is necessary. 另外,一些读者问为什么需要这种并发症。 Why not perform reading after the file has been written completely? 文件完全写入后,为什么不执行读取?

The reason is efficiency. 原因是效率。 The program will perform the following 该程序将执行以下操作

  1. Download a series of byte chunks of 1.5MB each. 下载一系列每个1.5MB的字节块。 The program will receive thousands of such chunks that can total up to 30GB. 该程序将接收成千上万个这样的块,总计可达30GB。 Also, chunks are downloaded concurrently in order to maximize bandwidth, so they may arrive out of order . 而且,为了同时获得最大带宽,将同时下载块,因此它们可能会乱序到达
  2. The program will send each chunk for processing as soon as they have arrived. 程序将在到达每个块后立即发送它们进行处理。 Please note that they will be sent for processing in order . 请注意,它们将按顺序发送进行处理。 If chunk m arrives before chunk m-1 does, they will be buffered on disk until chunk m-1 arrives and is sent for processing. 如果块m在块m-1之前到达,它们将被缓存在磁盘上,直到块m-1到达并被发送以进行处理。
  3. perform processing of these chunks starting from chunk 0 up to chunk n until every chunks has been processed 对这些块执行从块0到块n的处理,直到处理完每个块为止
  4. Resend the processed result back. 重新发送处理后的结果。

If we are to wait for the whole file to be transferred, it will introduce a huge delay on what is supposed to be a real-time system. 如果我们要等待整个文件都被传输,这将对应该是实时系统的系统造成巨大的延迟。

Use a RandomAccessFile . 使用RandomAccessFile Via a getChannel or such one could use a ByteBuffer . 通过getChannel或此类可以使用ByteBuffer

You will not be able to "insert" or "delete" middle parts of the file. 您将无法“插入”或“删除”文件的中间部分。 For such a purpose your original approach would be fine, but using two files. 为此,您的原始方法会很好,但是使用两个文件。

For concurrency: to keep in synch you could maintain one single object model of the file, do changes there. 并发性:为了保持同步,您可以维护文件的一个对象模型,然后在其中进行更改。 Only the pending changes need to be kept in memory, other hierarchical data could be reread and reparsed as needed. 只有待处理的更改需要保留在内存中,其他层次结构数据可以根据需要重新读取和重新解析。

You should use PipedInputStream and PipedOutputStream: 您应该使用PipedInputStream和PipedOutputStream:

static Thread newCopyThread(InputStream is, OutputStream os) {
    Thread t = new Thread() {
        @Override
        public void run() {
            byte[] buffer = new byte[2048];
            try {
                while (true) {
                    int size = is.read(buffer);
                    if (size < 0) break;
                    os.write(buffer, 0, size);
                }
                is.close();
                os.close();
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
            }
        }
    };
    return t;
}

public void main(String[] args) throws IOException, InterruptedException {
    ByteArrayInputStream bi = new ByteArrayInputStream("abcdefg".getBytes());
    PipedInputStream is = new PipedInputStream();
    PipedOutputStream os = new PipedOutputStream(is);
    Thread p = newCopyThread(bi, os);
    Thread c = newCopyThread(is, System.out);
    p.start();
    c.start();
    p.join();
    c.join();
}

So your problem (as you've cleared it up now) is that you can't start processing until chunk#1 has arrived, and you need to buffer every chunk#N (N > 1) until you can process them. 因此,您的问题(如您现在清除的问题)是,直到chunk#1到达后才能开始处理,并且需要缓冲每个chunk#N(N> 1),直到可以处理它们为止。

I would write each chunk to their own file and create a custom InputStream that will read every chunk in order. 我将每个块写入其自己的文件,并创建一个自定义InputStream ,它将按顺序读取每个块。 While downloading the chunkfile would be named something like chunk.1.downloading and when the whole chunk is loaded it will be renamed to chunk.1 . 在下载块文件时,它将被命名为chunk.1类的chunk.1.downloading并且在加载整个块时,会将其重命名为chunk.1

The custom InputStream will check to see if file chunk.N exists (where N = 1...X). 自定义InputStream将检查文件chunk.N存在chunk.N (其中N = 1 ... X)。 If not, it will block. 如果没有,它将阻塞。 Each time a chunk has been downloaded completely, the InputStream is notified, it will check if the downloaded chunk was the next one to be processed. 每次完全下载一个块时,都会通知InputStream ,它会检查下载的块是否是下一个要处理的块。 If yes, read as normally, otherwise block again. 如果是,请正常阅读,否则再次阻止。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM