简体   繁体   English

在gzip文件Perl中读取和查找

[英]Read & Seek in gzip files Perl

I am trying to read given set of gzip/plain xml files and printing some portions of these files into output xml files based on given offset and length values. 我试图读取给定的gzip / plain xml文件集,并根据给定的偏移量和长度值将这些文件的某些部分打印到输出xml文件中。

The offset values are keys of hash %offhash and corresponding keys are length. 偏移值是哈希%offhash的键,而相应的键是length。

Here is the funcntion I used for generating output files- 这是我用于生成输出文件的功能-

sub fileproc {
   my $infile = shift;
   my $outfile = shift;
   my $FILEH;
   $| = 1;
    $outfile =~ s/.gz$//;
   if($infile =~ m/\.gz$/i){
       open( $FILEH,"gunzip -c $infile | ") or die "Could not open input $infile";
   }
   else{
       open( $FILEH, "<", $infile) or die "Could not open input $infile";
   }

   open(my $OUTH, ">", $outfile) or die "Couldn't open file, $!";
   foreach my $offset (sort{$a <=> $b} keys %offhash)
   {
       my $record="";
       seek ($FILEH, $offset, 0);
       read ($FILEH, $record, $offhash{$offset}, 0);
       print $OUTH "$record";
   }
    close $FILEH;
    close $OUTH;
}

This function works properly for plain xml input files but creating some buffering issue when there are some(or all) .xml.gz files in the input file set. 此功能适用于普通xml输入文件,但是当输入文件集中存在某些(或全部) .xml.gz文件时,会产生一些缓冲问题 The output file in this case contains data from some previous read input( .gz) files. 在这种情况下,输出文件包含来自某些先前读取的input( .gz)文件的数据。

It seems the problem is in the line-- 似乎问题就在这里

open( $FILEH,"gunzip -c $infile | ") or die "Could not open input $infile";

Can anyone help me to resolve this issue? 谁能帮我解决这个问题?

Thanks in advance. 提前致谢。

You can only seek in regular files, not in the output of programs or STDIN etc. If you want to do this, you need to add a buffering layer yourself, but note that you might to need to buffer the whole uncompressed file just to be able to seek in it. 您只能在常规文件中查找,而不能在程序或STDIN等的输出中查找。如果要执行此操作,则需要自己添加一个缓冲层,但请注意,可能只需要缓冲整个未压缩的文件即可。能够寻找它。

Even if you don't gunzip with an external program, but use something like IO::Gzip you will not be able to seek, because the inherent way gzip (and other compressions) work, is that you need to read all the previous data to be able to decompress the data at the current file position. 即使您不使用外部程序进行Gunzip压缩,而是使用IO :: Gzip之类的东西,您也将无法找到,因为gzip(和其他压缩方式)的固有工作方式是您需要读取所有之前的数据以便能够在当前文件位置解压缩数据。 There are ways around it to limit the amount of necessary previous data, but then you would need to specifically prepare your gzip file and it will grow bigger. 有很多方法可以限制以前需要的数据量,但是随后您需要专门准备gzip文件,它将变得更大。 I'm not aware of any module which implements this currently, but I did a proof-of-concept once so I know it works. 我目前尚不知道有哪个模块可以实现此功能,但是我做了一次概念验证,因此我知道它可以工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM