简体   繁体   中英

Java compressor not reading file completely

We have an issue unzipping bz2 files in Java, whereby the input stream thinks it's finished after reading ~3% of the file.

We would welcome any suggestions for how to decompress and read large bz2 files which have to be processed line by line.

Here are the details of what we have done so far:

For example, a bz2 file is 2.09 GB in size and uncompressed it is 24.9 GB

The code below only reads 343,800 lines of the actual ~10 million lines the file contains.

Modifying the code to decompress the bz2 into a text file (FileInputStream straight into the CompressorInputStream) results in a file of ~190 MB - irrespective of the size of the bz2 file. I have tried setting a buffer value of 2048 bytes, but this has no effect on the outcome.

We have executed the code on Windows 64 bit and Linux/CentOS both with the same outcome.

Could the buffered reader come to an empty, "null" line and cause the code to exit the while-loop?

import org.apache.commons.compress.compressors.*;
import java.io.*;

...

CompressorInputStream is = new CompressorStreamFactory()
    .createCompressorInputStream(
        new BufferedInputStream(
            new FileInputStream(filePath)));

lineNumber = 0;
line = "";
br = new BufferedReader(
    new InputStreamReader(is));

while ((line = br.readLine()) != null) {
    this.processLine(line, ++lineNumber);
}

Even this code, which forces an exception when the end of the stream is reached, has exactly the same result:

byte[] buffer = new byte[1024];
int len = 1;

while (len == 1) {
    out.write(buffer, 0, is.read(buffer));
    out.flush();
}

There is nothing obviously wrong with your code; it should work. This means the problem must be elsewhere.

Try to enable logging (ie print the lines as you process them). Make sure there are no gaps in the input (maybe write the lines to a new file and do a diff). Use bzip2 --test to make sure the input file isn't buggy. Check whether it always fails for the same line (maybe the input contains odd characters or binary data?)

The issue lies with the bz2 files: they were created using a version of Hadoop which includes bad block headers inside the files.

Current Java solutions stumble over this, while others ignore it or handle it somehow.

Will look for a solution/workaround.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM