简体   繁体   English

使用bzip2低级例程来压缩数据块

[英]Using bzip2 low-level routines to compress chunks of data

The Overview 概述

I am using the low-level calls in the libbzip2 library: BZ2_bzCompressInit() , BZ2_bzCompress() and BZ2_bzCompressEnd() to compress chunks of data to standard output. 我正在使用libbzip2库中的低级调用: BZ2_bzCompressInit()BZ2_bzCompress()BZ2_bzCompressEnd()来将数据块压缩到标准输出。

I am migrating working code from higher-level calls, because I have a stream of bytes coming in and I want to compress those bytes in sets of discrete chunks (a discrete chunk is a set of bytes that contains a group of tokens of interest — my input is logically divided into groups of these chunks). 我正在从更高级别的调用迁移工作代码,因为我有一个字节流进来,我想在离散块的集合中压缩这些字节( 离散块是一组包含一组感兴趣的令牌的字节 -我的输入在逻辑上被分成这些块的组。

A complete group of chunks might contain, say, 500 chunks, which I want to compress to one bzip2 stream and write to standard output. 一组完整的块可能包含500个块,我想压缩到一个bzip2流并写入标准输出。

Within a set, using the pseudocode I outline below, if my example buffer is able to hold 101 chunks at a time, I would open a new stream, compress 500 chunks in runs of 101, 101, 101, 101, and one final run of 96 chunks that closes the stream. 在一个集合中,使用我在下面概述的伪代码,如果我的示例缓冲区一次能够容纳101个块,我将打开一个新流,在101,101,101,101和最后一次运行中压缩500个块96个关闭流的块。

The Problem 问题

The issue is that my bz_stream structure instance, which keeps tracks of the number of compressed bytes in a single pass of the BZ2_bzCompress() routine, seems to claim to be writing more compressed bytes than the total bytes in the final, compressed file. 问题是我的bz_stream结构实例在BZ2_bzCompress()例程的单次传递中保留了压缩字节数的跟踪,似乎声称写的压缩字节比最终压缩文件中的总字节数要多。

For example, the compressed output could be a file with a true size of 1234 bytes, while the number of reported compressed bytes (which I track while debugging) is somewhat higher than 1234 bytes (say 2345 bytes). 例如,压缩输出可以是真实大小为1234字节的文件,而报告的压缩字节数(我在调试时跟踪)略高于1234字节(比如2345字节)。

My rough pseudocode is in two parts. 我粗糙的伪代码分为两部分。

The first part is a rough sketch of what I do to compress a subset of chunks (and I know that I have another subset coming after this one): 第一部分是我对压缩块子集的做法的粗略草图(我知道在此之后我还有另一个子集):

bz_stream bzStream;
unsigned char bzBuffer[BZIP2_BUFFER_MAX_LENGTH] = {0};
unsigned long bzBytesWritten = 0UL;
unsigned long long cumulativeBytesWritten = 0ULL;
unsigned char myBuffer[UNCOMPRESSED_MAX_LENGTH] = {0};
size_t myBufferLength = 0;

/* initialize bzStream */
bzStream.next_in = NULL;
bzStream.avail_in = 0U;
bzStream.avail_out = 0U;
bzStream.bzalloc = NULL;
bzStream.bzfree = NULL;
bzStream.opaque = NULL;
int bzError = BZ2_bzCompressInit(&bzStream, 9, 0, 0); 

/* bzError checking... */

do
{
    /* read some bytes into myBuffer... */

    /* compress bytes in myBuffer */
    bzStream.next_in = myBuffer;
    bzStream.avail_in = myBufferLength;
    bzStream.next_out = bzBuffer;
    bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
    do 
    {
        bzStream.next_out = bzBuffer;
        bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
        bzError = BZ2_bzCompress(&bzStream, BZ_RUN);

        /* error checking... */

        bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
        cumulativeBytesWritten += bzBytesWritten;

        /* write compressed data in bzBuffer to standard output */
        fwrite(bzBuffer, 1, bzBytesWritten, stdout);
        fflush(stdout);
    } 
    while (bzError == BZ_OK);
} 
while (/* while there is a non-final myBuffer full of discrete chunks left to compress... */);

Now we wrap up the output: 现在我们结束输出:

/* read in the final batch of bytes into myBuffer (with a total byte size of `myBufferLength`... */

/* compress remaining myBufferLength bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do 
{
    bzStream.next_out = bzBuffer;
    bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
    bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);

    /* bzError error checking... */

    /* increment cumulativeBytesWritten by `bz_stream` struct `total_out_*` members */
    bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
    cumulativeBytesWritten += bzBytesWritten;

    /* write compressed data in bzBuffer to standard output */
    fwrite(bzBuffer, 1, bzBytesWritten, stdout);
    fflush(stdout);
} 
while (bzError != BZ_STREAM_END);

/* close stream */
bzError = BZ2_bzCompressEnd(&bzStream);

/* bzError checking... */

The Questions 问题

  • Am I calculating cumulativeBytesWritten (or, specifically, bzBytesWritten ) incorrectly, and how would I fix that? 我是否错误地计算了cumulativeBytesWritten (或者,特别是bzBytesWritten ),我将如何解决这个问题?

I have been tracking these values in a debug build, and I do not seem to be "double counting" the bzBytesWritten value. 我一直在调试版本中跟踪这些值,我似乎并没有“重复计算” bzBytesWritten值。 This value is counted and used once to increment cumulativeBytesWritten after each successful BZ2_bzCompress() pass. 在每次成功执行BZ2_bzCompress()后,此值将被计算并使用一次以递增cumulativeBytesWritten

  • Alternatively, am I not understanding the correct use of the bz_stream state flags? 或者,我不理解正确使用bz_stream状态标志?

For example, does the following compress and keep the bzip2 stream open, so long as I keep sending some bytes? 例如,以下压缩并保持bzip2流打开,只要我继续发送一些字节?

bzError = BZ2_bzCompress(&bzStream, BZ_RUN);

Likewise, can the following statement compress data, so long as there are at least some bytes are available to access from the bzStream.next_in pointer ( BZ_RUN ), and then the stream is wrapped up when there are no more bytes available ( BZ_FINISH )? 同样,以下语句可以压缩数据,只要至少有一些字节可用于从bzStream.next_in指针( BZ_RUN )访问,然后当没有更多可用字节( BZ_FINISH )时流被包装?

bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
  • Or, am I not using these low-level calls correctly at all? 或者,我没有正确使用这些低级别的呼叫吗? Should I go back to using the higher-level calls to continuously append a grouping of compressed chunks of data to one main file? 我应该回到使用更高级别的调用来连续地将一组压缩数据块附加到一个主文件中吗?

There's probably a simple solution to this, but I've been banging my head on the table for a couple days in the course of debugging what could be wrong, and I'm not making much progress. 可能有一个简单的解决方案,但是在调试可能出错的过程中,我已经在桌子上敲了几天,而且我没有取得多大进展。 Thank you for any advice. 谢谢你的任何建议。

In answer to my own question, it appears I am miscalculating the number of bytes written. 在回答我自己的问题时,似乎我错误地计算了写入的字节数。 I should not use the total_out_* members. 我不应该使用total_out_*成员。 The following correction works properly: 以下更正正常:

bzBytesWritten = sizeof(bzBuffer) - bzStream.avail_out;

The rest of the calculations follow. 其余的计算如下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM