简体   繁体   中英

Java - Read BZ2 file and uncompress/parse on the fly

I have a fairly large BZ2 file that with several text files in it. Is it possible for me to use Java to uncompress certain files inside the BZ2 file and uncompress/parse the data on the fly? Let's say that a 300mb BZ2 file contains 1 GB of text. Ideally, I'd like my java program to say read 1 mb of the BZ2 file, uncompress it on the fly, act on it and keep reading the BZ2 file for more data. Is that possible?

Thanks

The commons-compress library from apache is pretty good. Here's their samples page: http://commons.apache.org/proper/commons-compress/examples.html

Here's the latest maven snippet:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-compress</artifactId>
    <version>1.10</version>
</dependency>

And here's my util method:

public static BufferedReader getBufferedReaderForCompressedFile(String fileIn) throws FileNotFoundException, CompressorException {
    FileInputStream fin = new FileInputStream(fileIn);
    BufferedInputStream bis = new BufferedInputStream(fin);
    CompressorInputStream input = new CompressorStreamFactory().createCompressorInputStream(bis);
    BufferedReader br2 = new BufferedReader(new InputStreamReader(input));
    return br2;
}

The Ant project contains a bzip2 library. Which has a org.apache.tools.bzip2.CBZip2InputStream class. You can use this class to decompress the bzip2 file on the fly - it just extends the standard Java InputStream class.

You can use org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream from Apache commons-compress

InputStream inputStream = new BZip2CompressorInputStream(new FileInputStream(xmlBz2File), true) // true should be used for big files, as I understand

and than org.apache.commons.compress.utils.IOUtils :

    int pos = 0;
    int step = 1024 * 32;
    byte[] buffer = new byte[step];
    int actualLength = 1;
    while (actualLength > 0) {
        actualLength = IOUtils.readFully(inputStream, buffer, pos, step);
        pos += actualLength;
        String str = new String(buffer, 0, actualLength, StandardCharsets.UTF_8);
        // something what you want to do
    }

But it may be hard to deal with back presure (consumer may be faster then producer and vice versa). So I tried to use Akka Streams with BZip2CompressorInputStream .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM