简体   繁体   English

可以将 python 中的 bz2 解压缩到文件而不是内存

[英]Possible to decompress bz2 in python to a file instead of memory

I've worked with decompressing and reading files on the fly in memory with the bz2 library.我曾使用bz2库在内存中动态解压缩和读取文件。 However, i've read through the documentation and can't seem to just simply decompress the file to create a brand new file on the file system with the decompressed data without memory storage.但是,我已经通读了文档,似乎不能简单地解压缩文件以在文件系统上使用没有内存存储的解压缩数据创建一个全新的文件。 Sure, you could read line by line using BZ2Decompressor then write that to a file, but that would be insanely slow.当然,您可以使用 BZ2Decompressor 逐行读取,然后将其写入文件,但这会非常慢。 (Decompressing massive files, 50GB+). (解压海量文件,50GB+)。 Is there some method or library I have overlooked to achieve the same functionality as the terminal command bz2 -d myfile.ext.bz2 in python without using a hacky solution involving a subprocess to call that terminal command?是否有一些方法或库我忽略了在 python 中实现与终端命令bz2 -d myfile.ext.bz2相同的功能,而不使用涉及子进程的 hacky 解决方案来调用该终端命令?

Example why bz2 is so slow:为什么 bz2 这么慢的例子:

Decompressing that file via bz2 -d: 104seconds通过 bz2 -d 解压缩该文件:104 秒

Analytics on a decompressed file(just involves reading line by line): 183seconds解压文件分析(只需要逐行读取):183秒

with open(file_src) as x:
    for l in x:

Decompressing on the file and using analytics: Over 600 seconds (This time should be max 104+183)解压文件并使用分析:超过600秒(这个时间应该是最大104+183)

if file_src.endswith(".bz2"):
    bz_file = bz2.BZ2File(file_src)
    for l in bz_file:

You could use the bz2.BZ2File object which provides a transparent file-like handle.您可以使用bz2.BZ2File对象,它提供了一个类似文件的透明句柄。

(edit: you seem to use that already, but don't use readlines() on a binary file, or on a text file because in your case the block size isn't big enough which explains why it's slow) (编辑:你似乎已经使用了它,但不要在二进制文件或文本文件上使用readlines()因为在你的情况下块大小不够大这解释了为什么它很慢)

Then use shutil.copyfileobj to copy to the write handle of your output file (you can adjust block size if you can afford the memory)然后使用shutil.copyfileobj复制到输出文件的写句柄(如果你能负担得起内存,你可以调整块大小)

import bz2,shutil

with bz2.BZ2File("file.bz2") as fr, open("output.bin","wb") as fw:
    shutil.copyfileobj(fr,fw)

Even if the file is big, it doesn't take more memory than the block size.即使文件很大,它占用的内存也不会超过块大小。 Adjust the block size like this:像这样调整块大小:

shutil.copyfileobj(fr,fw,length = 1000000)  # read by 1MB chunks

For smaller files that you can store in memory before you save to a file, you can use bz2.open to decompress the file and save it as an uncompressed new file.对于可以在保存到文件之前存储在内存中的较小文件,可以使用bz2.open解压缩文件并将其保存为未压缩的新文件。

import bz2

#decompress data
with bz2.open('compressed_file.bz2', 'rb') as f:
    uncompressed_content = f.read()

#store decompressed file
with open('new_uncompressed_file.dat', 'wb') as f:
   f.write(uncompressed_content)
   f.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM