简体   繁体   English

我如何在c中读取巨大的.gz文件(未压缩超过5个gig)

[英]how do I read a huge .gz file (more than 5 gig uncompressed) in c

I have some .gz compressed files which is around 5-7gig uncompressed. 我有一些.gz压缩文件,大约5-7gig未压缩。 These are flatfiles. 这些是平面文件。

I've written a program that takes a uncompressed file, and reads it line per line, which works perfectly. 我编写了一个程序,该程序需要一个未压缩的文件,并逐行读取它,效果很好。

Now I want to be able to open the compressed files inmemory and run my little program. 现在,我希望能够打开压缩文件内存并运行我的小程序。

I've looked into zlib but I can't find a good solution. 我研究了zlib,但是找不到一个好的解决方案。

Loading the entire file is impossible using gzread(gzFile,void *,unsigned), because of the 32bit unsigned int limitation. 由于32位无符号int限制,因此无法使用gzread(gzFile,void *,unsigned)加载整个文件。

I've tried gzgets, but this almost doubles the execution time, vs reading in using gzread.(I tested on a 2gig sample.) 我已经尝试过gzgets,但是与使用gzread进行读取相比,这几乎使执行时间翻了一番。(我在2gig样本上进行了测试。)

I've also looked into "buffering", such as splitting the gzread process into multiple 2gig chunks, find the last newline using strcchr, and then setting the gzseek. 我还研究了“缓冲”,例如将gzread进程分为多个2gig块,使用strcchr找到最后一个换行符,然后设置gzseek。 But gzseek will emulate a total file uncompression. 但是gzseek将模拟总文件解压缩。 which is very slow. 这很慢。

I fail to see any sane solution to this problem. 我看不到任何明智的解决方案。 I could always do some checking, whether or not a current line actually has a newline (should only occure in the last partially read line), and then read more data from the point in the program where this occurs. 我总是可以做一些检查,即当前行是否实际上有换行符(应该只出现在最后部分读取的行中),然后从程序中的该点开始读取更多数据。 But this could get very ugly. 但这可能会变得非常丑陋。

Does anyhow have any suggestions? 有什么建议吗?

thanks 谢谢

edit: I dont need to have the entire file at once,just need one line a time, but I got a fairly huge machine, so if that was the easiest I would have no problems. 编辑:我不需要一次拥有整个文件,一次只需要一行,但是我有一台相当大的机器,所以如果那是最简单的话,我就不会有问题。

For all those that suggest piping the stdin, I've experienced extreme slowdowns compared to opening the file. 对于所有建议使用stdin进行传输的建议,与打开文件相比,我都经历了极慢的速度。 Here is a small code snippet I made some months ago, that illustrates it. 这是我几个月前制作的一个小代码段,对此进行了说明。

time ./a.out 59846/59846.txt
#       59846/59846.txt
18255221

real    0m4.321s
user    0m2.884s
sys     0m1.424s
time ./a.out <59846/59846.txt
18255221

real    1m56.544s
user    1m55.043s
sys     0m1.512s

And the source code 和源代码

#include <iostream>
#include <fstream>
#define LENS 10000

int main(int argc, char **argv){
  std::istream *pFile;

  if(argc==2)//ifargument supplied
    pFile = new std::ifstream(argv[1],std::ios::in);
  else //if we want to use stdin
    pFile = &std::cin;

  char line[LENS];
  if(argc==2) //if we are using a filename, print it.
    printf("#\t%s\n",argv[1]);

  if(!pFile){
    printf("Do you have permission to open file?\n");
    return 0;
  }

  int numRow=0;
  while(!pFile->eof()) {
    numRow++;
    pFile->getline(line,LENS);
  }
  if(argc==2)
    delete pFile;
  printf("%d\n",numRow);
  return 0;
}  

thanks for your replies, I'm still waiting the golden apple 感谢您的回复,我还在等金苹果

edit2: using the cstyle FILE pointers instead of c++ streams is much much faster. edit2:使用cstyle FILE指针代替c ++流要快得多。 So I think this is the way to go. 所以我认为这是要走的路。

Thank for all your input 感谢您的所有投入

gzip -cd compressed.gz | yourprogram

just go ahead and read it line by line from stdin as it is uncompressed. 只需继续并从stdin逐行读取它,因为它是未压缩的。

EDIT : Response to your remarks about performance. 编辑 :回应您对性能的评论。 You're saying reading STDIN line by line is slow compared to reading an uncompressed file directly. 您说的是逐行读取STDIN比直接读取未压缩的文件慢。 The difference lies within terms of buffering. 区别在于缓冲方面。 Normally pipe will yield to STDIN as soon as the output becomes available (no, or very small buffering there). 通常,一旦输出可用(不存在缓冲或缓冲很小),管道将立即屈服至STDIN。 You can do "buffered block reads" from STDIN and parse the read blocks yourself to gain performance. 您可以从STDIN进行“缓冲块读取”,并自己解析读取的块以获得性能。

You can achieve the same result with possibly better performance by using gzread() as well. 您也可以通过使用gzread()获得相同的结果,并可能获得更好的性能。 (Read a big chunk, parse the chunk, read the next chunk, repeat) (读取一个大块,解析该块,读取下一个块,重复)

gzread only reads chunks of the file, you loop on it as you would using a normal read() call. gzread只读取文件的一部分,就像使用普通的read()调用一样,在文件上循环。

Do you need to read the entire file into memory ? 您是否需要将整个文件读入内存?

If what you need is to read lines, you'd gzread() a sizable chunk(say 8192 bytes) into a buffer, loop through that buffer and find all '\\n' characters and process those as individual lines. 如果您需要读取行,则可以将一个相当大的块(例如8192字节)gzread()放入缓冲区,遍历该缓冲区并找到所有'\\ n'字符并将其作为单独的行进行处理。 You'd have to save the last piece incase there is just part of a line, and prepend that to the data you read next time. 如果只有一行的一部分,则必须保存最后一块,并将其放在下一次读取的数据之前。

You could also read from stdin and invoke your app like 您也可以从stdin中读取并调用您的应用,例如

zcat bigfile.gz | zcat bigfile.gz | ./yourprogram ./您的程序

in which case you can use fgets and similar on stdin. 在这种情况下,您可以在stdin上使用fgets等。 This is also beneficial in that you'd run decompression on one processor and processing the data on another processor :-) 这也很有用,因为您可以在一个处理器上运行解压缩,然后在另一个处理器上处理数据:-)

I don't know if this will be an answer to your question, but I believe it's more than a comment: 我不知道这是否可以回答您的问题,但我认为这不仅仅是评论:

Some months ago I discovered that the contents of Wikipedia can be downloaded in much the same way as the StackOverflow data dump. 几个月前,我发现可以用与StackOverflow数据转储几乎相同的方式下载Wikipedia的内容。 Both decompress to XML. 两者都解压缩为XML。

I came across a description of how the multi-gigabyte compressed dump file could be parsed. 我遇到了有关如何解析千兆字节压缩转储文件的描述。 It was done by Perl scripts, actually, but the relevant part for you was that Bzip2 compression was used. 实际上,这是由Perl脚本完成的,但是对您而言重要的是使用了Bzip2压缩。

Bzip2 is a block compression scheme, and the compressed file could be split into manageable pieces, and each part uncompressed individually . Bzip2是一种​​块压缩方案,可以将压缩文件拆分为可管理的部分,并且每个部分都分别进行不压缩

Unfortunately, I don't have a link to share with you, and I can't suggest how you would search for it, except to say that it was described on a Wikipedia 'data dump' or 'blog' page. 不幸的是,我没有与您共享的链接,并且我不能建议您如何搜索它,除非说它是在Wikipedia的“数据转储”或“博客”页面上描述的。

EDIT: Actually, I do have a link 编辑:实际上,我确实有一个链接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM