简体   繁体   English

根据perl中的列值将大文件拆分为小文件

[英]splitting a large file into small files based on column value in perl

I am trying to split up a large file (having around 17.6 million data) into 6-7 small files based on the column value.Currently, I am using sql bcp utility to dump in all data into one table and creating seperate files using bcp out utility. 我正在尝试根据列值将一个大文件(大约有1760万个数据)拆分为6-7个小文件。当前,我正在使用sql bcp实用程序将所有数据转储到一个表中,并使用bcp创建单独的文件出实用程序。

But someone suggested me to use Perl as it would be more faster and you don't need to create a table for that.As I am not a perl guy. 但是有人建议我使用Perl,因为这样做会更快,并且您不需要为此创建表。因为我不是perl。 I am not sure how to do it in perl. 我不确定如何在perl中执行此操作。 Any help.. 任何帮助..

INPUT file : 输入文件 :

inputfile.txt inputfile.txt

0010|name|address|city|.........
0020|name|number|address|......
0030|phone no|state|street|...

output files: 输出文件:

0010.txt 0010.txt

0010|name|address|city|.........

0020.txt 0020.txt

0020|name|number|address|......

0030.txt 0030.txt

0030|phone no|state|street|...

It is simplest to keep a hash of output file handles, keyed by the file name. 保留输出文件句柄的哈希(由文件名键入)是最简单的。 This program shows the idea. 该程序显示了这个想法。 The number at the start of each record is used to create the name of the file where it belongs, and file of that name is opened unless we already have a file handle for it. 每条记录开头的数字用于创建文件所属的文件名,除非我们已经有了文件句柄,否则将打开该名称的文件。

All of the handles are closed once all of the data has been processed. 处理完所有数据后,将关闭所有句柄。 Any errors are caught by use autodie , so explicit checking of the open , print and close calls is unnecessary. use autodie可以捕获任何错误,因此use autodie显式检查openprintclose调用。

use strict;
use warnings;
use autodie;

open my $in_fh, '<', 'inputfile.txt';

my %out_fh;

while (<$in_fh>) {
  next unless /^(\d+)/;
  my $filename = "$1.txt";
  open $out_fh{$filename}, '>', $filename unless $out_fh{$filename};
  print { $out_fh{$filename} } $_;
}

close $_ for values %out_fh;

Note close caught me out here because, unlike most operators that work on $_ if you pass no parameters, a bare close will close the currently selected file handle. 注意 close在这里引起了我的注意 ,因为与大多数在$_上工作的运算符不同,如果您不传递任何参数,完全close将关闭当前选择的文件句柄。 That is a bad choice IMO, but it's way to late to change it now IMO这是一个不好的选择,但是现在更改它已经很晚了

17.6 million rows is going to be a pretty large file, I'd imagine. 我想,1760万行将是一个很大的文件。 It'll still be slow with perl to process. Perl处理起来仍然很慢。

That said, you're going to want something like the below: 就是说,您将需要以下内容:

use strict;
use warnings;

my $input = 'FILENAMEHERE.txt';
my %results;

open(my $fh, '<', $input) or die "cannot open input file: $!";
while (<$fh>) {
  my ($key) = split '|', $_;
  my $array = $results{$key} || [];
  push $array, $_;
  $results{$key} = $array;
}

for my $filename (keys %results) {
  open(my $out, '>', "$filename.txt") or die "Cannot open output file $out: $!";
  print $out, join "\n", $results{$filename};
  close($out);
}

I haven't explicitly tested this, but it should get you going in the right direction. 我尚未对此进行明确测试,但是它可以使您朝正确的方向前进。

$ perl -F'|' -lane '
    $key = $F[0];
    $fh{$key} or open $fh{$key}, ">", "$key.txt" or die $!;
    print { $fh{$key} } $_
  ' inputfile.txt
perl -Mautodie -ne'
  sub out { $h{$_[0]} ||= open(my $f, ">", "$_[0].txt") && $f }
  print { out($1) } $_ if /^(\d+)/;
' file

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM