简体   繁体   English

压缩编解码器如何在Python中工作?

[英]How do the compression codecs work in Python?

I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. 我正在查询数据库并使用Python归档结果,并且正在尝试将数据写入日志文件时对其进行压缩。 I'm having some problems with it, though. 不过,我遇到了一些问题。

My code looks like this: 我的代码如下所示:

log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
    log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))

However, my output file has a size of 1,409,780. 但是,我的输出文件的大小为1,409,780。 Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. 在文件上运行bunzip2导致文件大小为943,634,在文件上运行bzip2会导致文件大小为217,275。 In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. 换句话说,未压缩的文件比使用Python的bzip编解码器压缩的文件小得多。 Is there a way to fix this, other than running bzip2 on the command line? 除了在命令行上运行bzip2之外,还有其他方法可以解决此问题?

I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip') ) to see if it fixed the problem. 我尝试了Python的gzip编解码器(将行更改为codecs.open(archive_file, 'a+', 'zip') )来查看它是否解决了问题。 I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. 我仍然会得到较大的文件,但是当我尝试解压缩文件时,也会得到gzip: archive_file: not in gzip format错误。 What's going on there? 那里发生了什么事?


EDIT : I originally had the file opened in append mode, not write mode. 编辑 :我最初有在追加模式,而不是写模式下打开文件。 While this may or may not be a problem, the question still holds if the file's opened in 'w' mode. 尽管这可能是问题,也可能不是问题,但是否以“ w”模式打开文件仍然存在问题。

As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; 正如其他张贴者所指出的那样,问题在于codecs库不使用增量编码器对数据进行编码;而是使用增量编码器对数据进行编码。 instead it encodes every snippet of data fed to the write method as a compressed block. 取而代之的是,它将馈送给write方法的每个数据片段编码为一个压缩块。 This is horribly inefficient, and just a terrible design decision for a library designed to work with streams. 这是非常低效的,对于设计用于流的库来说,这只是一个糟糕的设计决策。

The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. 具有讽刺意味的是,Python已经内置了一个非常合理的增量bz2编码器。 It's not difficult to create a "file-like" class which does the correct thing automatically. 创建一个“文件状”的类来自动执行正确的操作并不难。

import bz2

class BZ2StreamEncoder(object):
    def __init__(self, filename, mode):
        self.log_file = open(filename, mode)
        self.encoder = bz2.BZ2Compressor()

    def write(self, data):
        self.log_file.write(self.encoder.compress(data))

    def flush(self):
        self.log_file.write(self.encoder.flush())
        self.log_file.flush()

    def close(self):
        self.flush()
        self.log_file.close()

log_file = BZ2StreamEncoder(archive_file, 'ab')

A caveat : In this example, I've opened the file in append mode; 一个警告 :在这个例子中,我以附加模式打开了文件。 appending multiple compressed streams to a single file works perfectly well with bunzip2 , but Python itself can't handle it (although there is a patch for it). bunzip2 ,将多个压缩流附加到一个文件中可以很好地使用bunzip2 ,但是Python本身无法处理它(尽管有一个补丁程序 )。 If you need to read the compressed files you create back into Python, stick to a single stream per file. 如果您需要将创建的压缩文件读回到Python中,请坚持使用每个文件一个流。

The problem seems to be that output is being written on every write() . 问题似乎是在每个write()write()了输出。 This causes each line to be compressed in its own bzip block. 这将导致每行被压缩在其自己的bzip块中。

I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. 在将其写到文件之前,我会尝试在内存中构建更大的字符串(如果担心性能,则为字符串列表)。 A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses 可以拍摄的合适大小是900K(或更多),因为这是bzip2使用的块大小

The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. 问题是由于您使用了附加模式,导致文件包含多个压缩的数据块。 Look at this example: 看这个例子:

>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>>     f.write("ABCD")

On my system, this produces a file 12 bytes in size. 在我的系统上,这将生成一个12字节大小的文件。 Let's see what it contains: 让我们看看它包含的内容:

>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>>     f.read()
'ABCD'

Okay, now let's do another write in append mode: 好的,现在让我们在追加模式下进行另一次写入:

>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>>     f.write("EFGH")

The file is now 24 bytes in size, and its contents are: 该文件现在为24个字节,其内容为:

>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>>     f.read()
'ABCD'

What's happening here is that unzip expects a single zipped stream. 这里发生的是,解压缩期望单个压缩流。 You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. 您必须检查规格以查看多个串联流的正式行为,但是以我的经验,他们处理第一个并发流,而忽略其余数据。 That's what Python does. 这就是Python所做的。

I expect that bunzip2 is doing the same thing. 我希望bunzip2在做同样的事情。 So in reality your file is compressed, and is much smaller than the data it contains. 因此,实际上您的文件已压缩,并且比其包含的数据小得多。 But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; 但是,当您通过bunzip2运行它时,您只会得到您写给它的第一组记录。 the rest is discarded. 其余的被丢弃。

I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). 我不确定这与编解码器的实现方式有何不同,但是如果您使用gzip模块中的GzipFile,则可以将其增量添加到文件中,但是除非您要在其中写入大量数据,否则压缩效果不会很好。时间(可能> 1 KB)。 This is just the nature of the compression algorithms. 这只是压缩算法的本质。 If the data you are writing isn't super important (ie you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data. 如果您正在写入的数据不是非常重要(即,如果您的进程死了,您可以处理丢失数据),那么您可以编写一个缓冲的GzipFile类,包装导入的类以写出更大的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM