简体   繁体   中英

Decompressing bz2 files on Windows

I am trying to decompress a bz2 file with below code snippet which is provided in various places:

bz2_data = bz2.BZ2File(DATA_FILE+".bz2").read()
open(DATA_FILE, 'wb').write(bz2_data)

However, I am getting a much smaller file than I expect.

When I extract the file with 7z GUI I am receiving a file with a size of 248MB. However, with above code the file I get is 879kb.

When I read the extracted XML file, I can see that rest of the file is missing as I expect.

I am running anaconda on Windows machine, and as far as understand bz2 reaches an EOF before file actually ends.

By the way, I already run into this and this both did no good.

If this is a multi-stream file, then Python's bz2 module (before 3.3) doesn't support it:

Note This class does not support input files containing multiple streams (such as those produced by the pbzip2 tool). When reading such an input file, only the first stream will be accessible. If you require support for multi-stream files, consider using the third-party bz2file module (available from PyPI). This module provides a backport of Python 3.3's BZ2File class, which does support multi-stream files.

An alternative, drop-in replacement: bz2file should work though.

If it is a multistream file, you have to set mode to "r" or it will silently fail (eg output the compressed data as is).

This should do what you want:

with open(out_file_path) as out_file, BZ2File(bz2_file_path, "r") as bz2_file:
        for data in iter(lambda: bz2_file.read(100 * 1024), b""):
            out_file.write(data)

From the documentation:

If mode is 'r', the input file may be the concatenation of multiple compressed streams.

https://docs.python.org/3/library/bz2.html#bz2.BZ2File

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM