[英]How to *properly* compress and decompress a text file using bz2 and python
So I've had this system that scrapes and compresses files for a while now using bz2 compression.所以我有这个系统,现在使用 bz2 压缩来抓取和压缩文件。 The way it does so is using the following block of code I found on SO a few months back:这样做的方式是使用几个月前我在 SO 上找到的以下代码块:
Let's assume for the purposes of this post the filename is always file.XXXX
where XXXX
is the relevant extension.出于本文的目的,我们假设文件名始终为file.XXXX
,其中XXXX
是相关的扩展名。 We start with .txt
我们从.txt
开始
### How to compress a text file
filepath_compressed = "file.tar.bz2"
with open("file.txt", 'rb') as data:
tarbz2contents = bz2.compress(data.read(), 9)
with bz2.BZ2File(filepath_compressed, 'wb') as f_comp:
f_comp.write(tarbz2contents)
Now, to decompress it, I've always got it to work using a decompression software I have called Keka which decompresses the .tar.bz2
file to .tar
, then I run it through Keka again to get an "extensionless" file which I then add a .txt
to on my mac and then it works.现在,为了解压它,我总是使用我称为 Keka 的解压软件来工作,该软件将.tar.bz2
文件解压缩为.tar
,然后我再次通过 Keka 运行它以获得一个“无扩展”文件,我然后在我的mac上添加一个.txt
,然后它就可以工作了。
Now, to do decompress programmatically, I've tried a few things.现在,要以编程方式进行解压缩,我尝试了一些方法。 I've tried the stuff from this post and the code from this post .我已经尝试过这篇文章中的内容和这篇文章中的代码。 I've tried using BZ2Decompressor and BZ2File and everything.我试过使用 BZ2Decompressor 和 BZ2File 和一切。 I just seem to be missing something and I'm not sure what it is.我只是似乎遗漏了一些东西,我不确定它是什么。
Here is what I have so far, and I'd like to know what is wrong with this code:这是我到目前为止所拥有的,我想知道这段代码有什么问题:
import bz2, tarfile, shutil
# Decompress to tar
with bz2.BZ2File("file.tar.bz2") as fr, open("file.tar", "wb") as fw:
shutil.copyfileobj(fr, fw)
# Decompress from tar to txt
with tarfile.open("file.tar", "r:") as tar:
tar.extractall("file_out.txt")
This code crashes because of a " tarfile.ReadError: truncated header
" problem.此代码由于“ tarfile.ReadError: truncated header
”问题而崩溃。 I think the first context manager outputs a binary text file, and I tried decoding that but that failed too.我认为第一个上下文管理器输出一个二进制文本文件,我尝试对其进行解码,但也失败了。 What am i missing here i feel like a noob.我在这里想念什么,我觉得自己像个菜鸟。
If you would like a minimum runnable piece of code to replicate this, add the following to make a dummy file:如果您希望使用最少的可运行代码来复制它,请添加以下内容以创建一个虚拟文件:
lines = ["Line 1","Line 2", "Line 3"]
with open("file.txt", "w") as f:
for line in lines:
f.write(line+"\n")
The thing that you're making is not a .tar.bz2
file, but rather a .bz2.bz2
file.您正在制作的不是.tar.bz2
文件,而是.bz2.bz2
文件。 You are compressing twice with bzip2 (the second time with no effect), and there is no tar file generation anywhere to be seen.您使用 bzip2 压缩了两次(第二次没有效果),并且在任何地方都没有生成 tar 文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.