打印文件 header 中第 N 列的总和，现有行 bash

Question

I have an input file with billions of records and a header. Header consists of meta info, total number of rows and sum of the sixth column.我有一个包含数十亿条记录的输入文件和一个 header。Header 由元信息、总行数和第六列的总和组成。 I am splitting the file into small sizes, due to which my header record must be updated as the sum of sixth column and total rows is changed.我将文件拆分成小尺寸，因此我的 header 记录必须更新，因为第六列和总行的总和已更改。

This is the sample record这是示例记录

filename: testFile.text文件名：testFile.text

00|STMT|08-09-2022 13:24:56||5|13.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP

Expected:预期的：

filename: testFile_1.text文件名：testFile_1.text

00|STMT|08-09-2022 13:24:56||3|6.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP

filename: testFile_2.text文件名：testFile_2.text

00|STMT|08-09-2022 13:24:56||2|7.00|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP

I am able to split the file and calculate the sum but unable to replace the value in header part.我能够拆分文件并计算总和，但无法替换 header 部分中的值。 This is the script I have made这是我制作的脚本

#!/bin/bash

splitRowCount=$1
transactionColumn=$2

filename=$(basename -- "$3")
extension="${filename##*.}"
nameWithoutExt="${filename%.*}"

echo "splitRowCount: $splitRowCount"
echo "transactionColumn: $transactionColumn"


awk 'NR == 1 { head = $0 } NR % '$splitRowCount' == 2 { filename = "'$nameWithoutExt'_" int((NR-1)/'$splitRowCount')+1 ".'$extension'"; print head > filename } NR != 1 { print >> filename }' $filename


 ls *.txt | while read line
 do
  firstLine=$(head -n 1 $line);
  awk -F '|'  'NR !=1 {sum += '$transactionColumn'}END {print sum} '   $line
 done

Answer 1

With your shown samples please try following awk code(Written and tested in GNU awk ).使用您显示的示例，请尝试遵循awk代码（在 GNU awk中编写和测试）。 Here I have defined awk variables named fileInitials which contains your output file's initial name eg: testFile then extension which contains output file's extension eg: .txt here.在这里，我定义awk名为fileInitials的变量，其中包含您的 output 文件的初始名称，例如： testFile然后是包含 output 文件extension名的扩展名，例如： .txt 。 Then comes lines which will be your value on how many lines you want to have in a output file.然后是lines ，这将是您希望在 output 文件中包含多少行的值。

You need not to run shell + awk code, this could be done in a single awk like shown following.您不需要运行 shell + awk代码，这可以在单个awk中完成，如下所示。

awk -v count="1" -v fileInitials="testFile" -v extension=".txt" -v lines="3" '
BEGIN { FS=OFS="|" }
FNR==1{
  match($0,/^([^|]*\|[^|]*\|[^|]*\|[^|]*\|[^|]*)\|[^|]*(.*)/,arr)
  header1=arr[1]
  header2=arr[2]
  outputFile=(fileInitials count extension)
  next
}
{
  if(prev!=count){
    print (header1,sum header2 ORS val) > (outputFile)
    close(outputFile)
    outputFile=(fileInitials count extension)
    sum=0
    val=""
  }
  sum+=$6
  val=(val?val ORS:"") $0
  prev=count
  count=(++countline%lines==0?++count:count)
}
END{
  if(count && val){
     print (header1,sum header2 ORS val) > (outputFile)
     close(outputFile)
  }
}
'   Input_file

Answer 2

Here's an awk solution for splitting the original file into files of n records.这是一个awk解决方案，用于将原始文件拆分为n条记录的文件。 The idea is to accumulate the records until the given count is reached then generate a file with the updated header and the accumulated records:这个想法是累积记录直到达到给定的计数，然后生成一个包含更新后的 header 和累积记录的文件：

n=3
file=./testFile.text

awk -v numRecords="$n"  '
    BEGIN {
        FS = OFS = "|"

        if ( match(ARGV[1],/[^\/]\.[^\/]*$/) ) {
            filePrefix = substr(ARGV[1],1,RSTART)
            fileSuffix = substr(ARGV[1],RSTART+1)
        } else {
            filePrefix = ARGV[1]
            fileSuffix = ""
        }

        if (getline headerStr <= 0)
            exit 1
        split(headerStr, headerArr)
    }
    (NR-2) % numRecords == 0 && recordsCount {
        outfile = filePrefix "_" ++filesCount fileSuffix
        print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
        printf("%s", records) > outfile
        close(outfile)
        
        records = ""
        recordsCount = recordsSum = 0
    }
    {
        records = records $0 ORS
        recordsCount++
        recordsSum += $6
    }
    END {
        if (recordsCount) {
            outfile = filePrefix "_" ++filesCount fileSuffix
            print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
            printf("%s", records) > outfile
            close(outfile)
        }
    }
' "$file"

With the given sample you'll get:使用给定的示例，您将获得：

testFile_1.text

00|STMT|08-09-2022 13:24:56||3|6.1|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP

testFile_2.text

00|STMT|08-09-2022 13:24:56||2|7|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP

打印文件 header 中第 N 列的总和，现有行 bash

问题描述

2 个解决方案

解决方案1
2 2022-10-04 10:39:51

解决方案2
2 已采纳 2022-10-04 11:22:30

打印文件 header 中第 N 列的总和，现有行 bash

问题描述

2 个解决方案

解决方案1 2 2022-10-04 10:39:51

解决方案2 2 已采纳 2022-10-04 11:22:30

解决方案1
2 2022-10-04 10:39:51

解决方案2
2 已采纳 2022-10-04 11:22:30