简体   繁体   English

使用python限制bz2文件解压?

[英]Limit on bz2 file decompression using python?

I have numerous files that are compressed in the bz2 format and I am trying to uncompress them in a temporary directory using python to then analyze.我有许多以 bz2 格式压缩的文件,我正在尝试使用 python 将它们解压缩到临时目录中,然后进行分析。 There are hundreds of thousands of files so manually decompressing the files isn't feasible so I wrote the following script.有数十万个文件,因此手动解压缩文件是不可行的,因此我编写了以下脚本。

My issue is that whenever I try to do this, the maximum file size is 900 kb even though a manual decompression has each file around 6 MB.我的问题是,每当我尝试这样做时,最大文件大小为 900 kb,即使手动解压缩每个文件大约 6 MB。 I am not sure if this is a flaw in my code and how I am saving the data as a string to then copy to the file or a problem with something else.我不确定这是否是我的代码中的缺陷以及我如何将数据保存为字符串然后复制到文件或其他问题。 I have tried this with different files and I know that it works for files smaller than 900 kb.我已经用不同的文件尝试过这个,我知道它适用于小于 900 kb 的文件。 Has anyone else had a similar problem and knows of a solution?有没有其他人遇到过类似的问题并知道解决方案?

My code is below:我的代码如下:

import numpy as np
import bz2
import os
import glob

def unzip_f(filepath):
    '''
    Input a filepath specifying a group of Himiwari .bz2 files with common names
    Outputs the path of all the temporary files that have been uncompressed

    '''


    cpath = os.getcwd() #get current path
    filenames_ = []  #list to add filenames to for future use

    for zipped_file in glob.glob(filepath):  #loop over the files that meet the name criterea
        with bz2.BZ2File(zipped_file,'rb') as zipfile:   #Read in the bz2 files
            newfilepath = cpath +'/temp/'+zipped_file[-47:-4]     #create a temporary file
            with open(newfilepath, "wb") as tmpfile: #open the temporary file
                for i,line in enumerate(zipfile.readlines()):
                    tmpfile.write(line) #write the data from the compressed file to the temporary file



            filenames_.append(newfilepath)
    return filenames_


path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S*bz2'
unzip_f(path_)   

It returns the correct file paths with the wrong sizes capped at 900 kb.它返回正确的文件路径,错误的大小上限为 900 kb。

It turns out this issue is due to the files being multi stream which does not work in python 2.7.原来这个问题是由于文件是多流的,这在 python 2.7 中不起作用。 There is more info here as mentioned by jasonharper and here .还有更多的信息在这里由jasonharper,并提到在这里 Below is a solution just using the Unix command to decompress the bz2 files and then moving them to the temporary directory I want.下面是一个只使用Unix命令解压bz2文件然后将它们移动到我想要的临时目录的解决方案。 It is not as pretty but it works.它不是那么漂亮,但它有效。

import numpy as np
import os
import glob
import shutil

def unzip_f(filepath):
    '''
    Input a filepath specifying a group of Himiwari .bz2 files with common names
    Outputs the path of all the temporary files that have been uncompressed

    '''


    cpath = os.getcwd() #get current path
    filenames_ = []  #list to add filenames to for future use

    for zipped_file in glob.glob(filepath):  #loop over the files that meet the name criterea
        newfilepath = cpath +'/temp/'   #create a temporary file
        newfilename = newfilepath + zipped_file[-47:-4]

        os.popen('bzip2 -kd ' + zipped_file)
        shutil.move(zipped_file[-47:-4],newfilepath)

        filenames_.append(newfilename)
    return filenames_



path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S0*bz2'

unzip_f(path_)   

This is a known limitation in Python2, where the BZ2File class doesn't support multiple streams.这是 Python2 中的一个已知限制,其中BZ2File类不支持多个流。 This can be easily resolved by using bz2file , https://pypi.org/project/bz2file/ , which is a backport of Python3 implementation and can be used as a drop-in replacement.这可以通过使用bz2filehttps: bz2file轻松解决,它是 Python3 实现的后向移植,可以用作替代品。

After running pip install bz2file you can just replace bz2 with it: import bz2file as bz2 and everything should just work :)运行pip install bz2file您可以用它替换bz2import bz2file as bz2 ,一切都应该正常工作:)

The original Python bug report: https://bugs.python.org/issue1625原始 Python 错误报告: https : //bugs.python.org/issue1625

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM