如何使用C从bzip2存档中提取所有数据？

Question

I have a concatenated file made up of some number of bzip2 archives. 我有一个由一些bzip2档案组成的连接文件。 I also know the sizes of the individual bzip2 chunks in that file. 我也知道该文件中各个bzip2块的大小。

I would like to decompress a bzip2 stream from an individual bzip2 data chunk, and write the output to standard output. 我想从单独的bzip2数据块解压缩bzip2流，并将输出写入标准输出。

First I use fseek to move the file cursor to the desired archive byte, and then read the "size"-chunk of the file into a BZ2_bzRead call: 首先，我使用fseek将文件光标移动到所需的存档字节，然后将文件的“size”-chunk读入BZ2_bzRead调用：

int headerSize = 1234;
int firstChunkSize = 123456;
FILE *fp = fopen("pathToConcatenatedFile", "r+b");
char *bzBuf = malloc(sizeof(char) * firstChunkSize);
int bzError, bzNBuf;
BZFILE *bzFp = BZ2_bzReadOpen(&bzError, *fp, 0, 0, NULL, 0);

# move cursor past header of known size, to the first bzip2 "chunk"
fseek(*fp, headerSize, SEEK_SET); 

while (bzError != BZ_STREAM_END) {
    # read the first chunk of known size, decompress it
    bzNBuf = BZ2_bzRead(&bzError, bzFp, bzBuf, firstChunkSize);
    fprintf(stdout, bzBuf);
}

BZ2_bzReadClose(&bzError, bzFp);
free(bzBuf);
fclose(fp);

The problem is that when I compare the output of the fprintf statement with output from running bzip2 on the command line, I get two different answers. 问题是，当我将fprintf语句的输出与命令行上运行bzip2的输出进行比较时，我得到两个不同的答案。

Specifically, I get less output from this code than from running bzip2 on the command line. 具体来说，我从这个代码获得的输出少于在命令行上运行bzip2 。

More specifically, my output from this code is a smaller subset of the output from the command line process, and I am missing what is in the tail-end of the bzip2 chunk of interest. 更具体地说，我从这段代码输出的是命令行进程输出的一个较小的子集，我错过了感兴趣的bzip2块的尾端。

I have verified through another technique that the command-line bzip2 is providing the correct answer, and, therefore, some problem with my C code is causing output at the end of the chunk to go missing. 我已经通过另一种技术验证了命令行bzip2提供了正确的答案，因此，我的C代码的一些问题导致在块的末尾输出丢失。 I just don't know what that problem is. 我只是不知道那是什么问题。

If you are familiar with bzip2 or libbzip2 , can you provide any advice on what I am doing wrong in the code sample above? 如果您熟悉bzip2或libbzip2 ，您能否在上面的代码示例中提供有关我做错的建议？ Thank you for your advice. 感谢您的意见。

Answer 1

This is my source code: 这是我的源代码：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <bzlib.h>

int
bunzip_one(FILE *f) {
  int bzError;
  BZFILE *bzf;
  char buf[4096];

  bzf = BZ2_bzReadOpen(&bzError, f, 0, 0, NULL, 0);
  if (bzError != BZ_OK) {
    fprintf(stderr, "E: BZ2_bzReadOpen: %d\n", bzError);
    return -1;
  }

  while (bzError == BZ_OK) {
    int nread = BZ2_bzRead(&bzError, bzf, buf, sizeof buf);
    if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
      size_t nwritten = fwrite(buf, 1, nread, stdout);
      if (nwritten != (size_t) nread) {
        fprintf(stderr, "E: short write\n");
        return -1;
      }
    }
  }

  if (bzError != BZ_STREAM_END) {
    fprintf(stderr, "E: bzip error after read: %d\n", bzError);
    return -1;
  }

  BZ2_bzReadClose(&bzError, bzf);
  return 0;
}

int
bunzip_many(const char *fname) {
  FILE *f;

  f = fopen(fname, "rb");
  if (f == NULL) {
    perror(fname);
    return -1;
  }

  fseek(f, 0, SEEK_SET);
  if (bunzip_one(f) == -1)
    return -1;

  fseek(f, 42, SEEK_SET); /* hello.bz2 is 42 bytes long in my case */
  if (bunzip_one(f) == -1)
    return -1;

  fclose(f);
  return 0;
}

int
main(int argc, char **argv) {
  if (argc < 2) {
    fprintf(stderr, "usage: bunz <fname>\n");
    return EXIT_FAILURE;
  }
  return bunzip_many(argv[1]) != 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

I cared very much about proper error checking. 我非常关心正确的错误检查。 For example, I made sure that bzError was BZ_OK or BZ_STREAM_END before trying to access the buffer. 例如，在尝试访问缓冲区之前，我确保bzError是BZ_OK或BZ_STREAM_END 。 The documentation clearly says that for other values of bzError the returned number is undefined . 文档明确指出，对于bzError的其他值，返回的数字是未定义的 。
It shouldn't frighten you that about 50 percent of the code are concerned with error handling. 大约50％的代码与错误处理有关，不应该吓到你。 That's how it should be. 应该是这样的。 Expect errors everywhere. 期待到处都是错误。
The code still has some bugs. 代码仍有一些错误。 In case of errors it doesn't release the resources ( f , bzf ) properly. 如果出现错误，则不会正确释放资源（ f ， bzf ）。

And these are the commands I used for testing: 这些是我用于测试的命令：

$ echo hello > hello
$ echo world > world
$ bzip2 hello
$ bzip2 world
$ cat hello.bz2 world.bz2 > helloworld.bz2
$ gcc -W -Wall -Os -o bunz bunz.c -lbz2
$ ls -l *.bz2
-rw-r--r-- 1 roland None 42 Oct 12 09:26 hello.bz2
-rw-r--r-- 1 roland None 86 Oct 12 09:36 helloworld.bz2
-rw-r--r-- 1 roland None 44 Oct 12 09:26 world.bz2
$ ./bunz.exe helloworld.bz2 
hello
world

如何使用C从bzip2存档中提取所有数据？

问题描述

1 个解决方案

解决方案1
6 已采纳 2010-10-12 07:42:05

如何使用C从bzip2存档中提取所有数据？

问题描述

1 个解决方案

解决方案1 6 已采纳 2010-10-12 07:42:05

解决方案1
6 已采纳 2010-10-12 07:42:05