简体   繁体   English

如何使用C从bzip2存档中提取所有数据?

[英]How do I extract all the data from a bzip2 archive with C?

I have a concatenated file made up of some number of bzip2 archives. 我有一个由一些bzip2档案组成的连接文件。 I also know the sizes of the individual bzip2 chunks in that file. 我也知道该文件中各个bzip2块的大小。

I would like to decompress a bzip2 stream from an individual bzip2 data chunk, and write the output to standard output. 我想从单独的bzip2数据块解压缩bzip2流,并将输出写入标准输出。

First I use fseek to move the file cursor to the desired archive byte, and then read the "size"-chunk of the file into a BZ2_bzRead call: 首先,我使用fseek将文件光标移动到所需的存档字节,然后将文件的“size”-chunk读入BZ2_bzRead调用:

int headerSize = 1234;
int firstChunkSize = 123456;
FILE *fp = fopen("pathToConcatenatedFile", "r+b");
char *bzBuf = malloc(sizeof(char) * firstChunkSize);
int bzError, bzNBuf;
BZFILE *bzFp = BZ2_bzReadOpen(&bzError, *fp, 0, 0, NULL, 0);

# move cursor past header of known size, to the first bzip2 "chunk"
fseek(*fp, headerSize, SEEK_SET); 

while (bzError != BZ_STREAM_END) {
    # read the first chunk of known size, decompress it
    bzNBuf = BZ2_bzRead(&bzError, bzFp, bzBuf, firstChunkSize);
    fprintf(stdout, bzBuf);
}

BZ2_bzReadClose(&bzError, bzFp);
free(bzBuf);
fclose(fp);

The problem is that when I compare the output of the fprintf statement with output from running bzip2 on the command line, I get two different answers. 问题是,当我将fprintf语句的输出与命令行上运行bzip2的输出进行比较时,我得到两个不同的答案。

Specifically, I get less output from this code than from running bzip2 on the command line. 具体来说,我从这个代码获得的输出少于在命令行上运行bzip2

More specifically, my output from this code is a smaller subset of the output from the command line process, and I am missing what is in the tail-end of the bzip2 chunk of interest. 更具体地说,我从这段代码输出的是命令行进程输出的一个较小的子集,我错过了感兴趣的bzip2块的尾端。

I have verified through another technique that the command-line bzip2 is providing the correct answer, and, therefore, some problem with my C code is causing output at the end of the chunk to go missing. 我已经通过另一种技术验证了命令行bzip2提供了正确的答案,因此,我的C代码的一些问题导致在块的末尾输出丢失。 I just don't know what that problem is. 我只是不知道那是什么问题。

If you are familiar with bzip2 or libbzip2 , can you provide any advice on what I am doing wrong in the code sample above? 如果您熟悉bzip2libbzip2 ,您能否在上面的代码示例中提供有关我做错的建议? Thank you for your advice. 感谢您的意见。

This is my source code: 这是我的源代码:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <bzlib.h>

int
bunzip_one(FILE *f) {
  int bzError;
  BZFILE *bzf;
  char buf[4096];

  bzf = BZ2_bzReadOpen(&bzError, f, 0, 0, NULL, 0);
  if (bzError != BZ_OK) {
    fprintf(stderr, "E: BZ2_bzReadOpen: %d\n", bzError);
    return -1;
  }

  while (bzError == BZ_OK) {
    int nread = BZ2_bzRead(&bzError, bzf, buf, sizeof buf);
    if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
      size_t nwritten = fwrite(buf, 1, nread, stdout);
      if (nwritten != (size_t) nread) {
        fprintf(stderr, "E: short write\n");
        return -1;
      }
    }
  }

  if (bzError != BZ_STREAM_END) {
    fprintf(stderr, "E: bzip error after read: %d\n", bzError);
    return -1;
  }

  BZ2_bzReadClose(&bzError, bzf);
  return 0;
}

int
bunzip_many(const char *fname) {
  FILE *f;

  f = fopen(fname, "rb");
  if (f == NULL) {
    perror(fname);
    return -1;
  }

  fseek(f, 0, SEEK_SET);
  if (bunzip_one(f) == -1)
    return -1;

  fseek(f, 42, SEEK_SET); /* hello.bz2 is 42 bytes long in my case */
  if (bunzip_one(f) == -1)
    return -1;

  fclose(f);
  return 0;
}

int
main(int argc, char **argv) {
  if (argc < 2) {
    fprintf(stderr, "usage: bunz <fname>\n");
    return EXIT_FAILURE;
  }
  return bunzip_many(argv[1]) != 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}
  • I cared very much about proper error checking. 我非常关心正确的错误检查。 For example, I made sure that bzError was BZ_OK or BZ_STREAM_END before trying to access the buffer. 例如,在尝试访问缓冲区之前,我确保bzErrorBZ_OKBZ_STREAM_END The documentation clearly says that for other values of bzError the returned number is undefined . 文档明确指出,对于bzError的其他值,返回的数字是未定义的
  • It shouldn't frighten you that about 50 percent of the code are concerned with error handling. 大约50%的代码与错误处理有关,不应该吓到你。 That's how it should be. 应该是这样的。 Expect errors everywhere. 期待到处都是错误。
  • The code still has some bugs. 代码仍有一些错误。 In case of errors it doesn't release the resources ( f , bzf ) properly. 如果出现错误,则不会正确释放资源( fbzf )。

And these are the commands I used for testing: 这些是我用于测试的命令:

$ echo hello > hello
$ echo world > world
$ bzip2 hello
$ bzip2 world
$ cat hello.bz2 world.bz2 > helloworld.bz2
$ gcc -W -Wall -Os -o bunz bunz.c -lbz2
$ ls -l *.bz2
-rw-r--r-- 1 roland None 42 Oct 12 09:26 hello.bz2
-rw-r--r-- 1 roland None 86 Oct 12 09:36 helloworld.bz2
-rw-r--r-- 1 roland None 44 Oct 12 09:26 world.bz2
$ ./bunz.exe helloworld.bz2 
hello
world

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM