简体   繁体   English

根据条件和行数将文件拆分为多个文件

[英]Split file into several files based on condition and also number of lines approximately

I have a large file with a sample as below我有一个带有示例的大文件,如下所示

A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
A222, 00000, 555
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

It's a sample file which has order header(00000) and related order details(00100, 00200 etc.) I want to split file with around 40000 lines each such that each file has order headers and order details together.这是一个示例文件,其中包含订单标题(00000)和相关的订单详细信息(00100、00200 等)。我想拆分文件,每个文件大约有 40000 行,这样每个文件都有订单标题和订单详细信息。

I used GNU parallel to achieve the split of 40000 lines, But I am not able to achieve the split to satisfy the condition that makes sure that the Order Header and its related order details are all together in a line making sure that each file has around 40000 lines each我使用 GNU parallel实现了 40000 行的拆分,但我无法实现拆分以满足确保订单 Header 及其相关订单详细信息都在一行中的条件,以确保每个文件都有大约每个 40000 行

For the above sample file, if I have to split with around 5 lines each, I would use the below对于上面的示例文件,如果我必须每行拆分大约 5 行,我会使用下面的

parallel --pipe -N5 'cat > sample_{#}.txt' <sample.txt

But that would give me但这会给我

sample1.txt
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
A222, 00000, 555

sample2.txt
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

It would have 2nd Order header in the first file, and its related order details in the second one.它将在第一个文件中有第二个订单 header,在第二个文件中有相关的订单详细信息。

The desired should be期望的应该是

sample1.txt
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555

sample2.txt
A222, 00000, 555
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

You may try this code:你可以试试这段代码:

( export hdr=$(head -1 sample.txt); parallel  --pipe -N4 '{ echo "$hdr"; cat; } > sample_{#}.txt' < <(tail -n +2 sample.txt) )

We basically keep header row separate and run split on remaining lines while including header in each split file.我们基本上将 header 行分开并在其余行上运行拆分,同时在每个拆分文件中包含 header。

Single record:单条记录:

cat file | parallel --pipe --recstart 'A222, 00000, 555' -n1 'echo Single record;cat'

Multiple records (up to --block-size )多条记录(最多--block-size

cat file | parallel --pipe --recstart 'A222, 00000, 555' --block-size 100 'echo Multiple records;cat'

If 'A222' does not stay the same:如果“A222”不保持不变:

cat file | parallel -k --pipe --regexp --recstart '[A-Z]\d+, 00000' -N1 'echo Single record;cat'

When each Order Header has a lot of records, you might consider the simple当每个 Order Header 有很多记录时,可以考虑简单

csplit -z sample.txt '/00000,/' '{*}'

This will make a file for each Order Header.这将为每个订单 Header 创建一个文件。 It doesn't look at the requirement ~40K and might result in very much files and is only a viable solution when you have a limited number (perhaps 40?) different Order Headers.它不考虑约 40K 的要求,并且可能会产生非常多的文件,并且仅当您拥有有限数量(可能是 40 个?)不同的订单标题时才是可行的解决方案。

When you do want different headers combined in a file, consider当您确实希望在文件中组合不同的标头时,请考虑

awk -v max=40000 '
   function flush() {
      if (last+nr>max || sample==0) {
         outfile="sample_" sample++ ".txt";
         last=0;
      }
      for (i=0;i<nr;i++) print a[i] >> outfile;
      last+=nr;
      nr=0;
   }
   BEGIN { sample=0 }
   /00000,/ { flush(); }
   {a[nr++]=$0}
   END { flush() }
   ' sample.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM