[英]Print sum of Nth column at the header of file with existing rows bash
I have an input file with billions of records and a header. Header consists of meta info, total number of rows and sum of the sixth column.我有一个包含数十亿条记录的输入文件和一个 header。Header 由元信息、总行数和第六列的总和组成。 I am splitting the file into small sizes, due to which my header record must be updated as the sum of sixth column and total rows is changed.我将文件拆分成小尺寸,因此我的 header 记录必须更新,因为第六列和总行的总和已更改。
This is the sample record这是示例记录
filename: testFile.text文件名:testFile.text
00|STMT|08-09-2022 13:24:56||5|13.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
Expected:预期的:
filename: testFile_1.text文件名:testFile_1.text
00|STMT|08-09-2022 13:24:56||3|6.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
filename: testFile_2.text文件名:testFile_2.text
00|STMT|08-09-2022 13:24:56||2|7.00|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
I am able to split the file and calculate the sum but unable to replace the value in header part.我能够拆分文件并计算总和,但无法替换 header 部分中的值。 This is the script I have made这是我制作的脚本
#!/bin/bash
splitRowCount=$1
transactionColumn=$2
filename=$(basename -- "$3")
extension="${filename##*.}"
nameWithoutExt="${filename%.*}"
echo "splitRowCount: $splitRowCount"
echo "transactionColumn: $transactionColumn"
awk 'NR == 1 { head = $0 } NR % '$splitRowCount' == 2 { filename = "'$nameWithoutExt'_" int((NR-1)/'$splitRowCount')+1 ".'$extension'"; print head > filename } NR != 1 { print >> filename }' $filename
ls *.txt | while read line
do
firstLine=$(head -n 1 $line);
awk -F '|' 'NR !=1 {sum += '$transactionColumn'}END {print sum} ' $line
done
With your shown samples please try following awk
code(Written and tested in GNU awk
).使用您显示的示例,请尝试遵循awk
代码(在 GNU awk
中编写和测试)。 Here I have defined awk
variables named fileInitials
which contains your output file's initial name eg: testFile
then extension
which contains output file's extension eg: .txt
here.在这里,我定义awk
名为fileInitials
的变量,其中包含您的 output 文件的初始名称,例如: testFile
然后是包含 output 文件extension
名的扩展名,例如: .txt
。 Then comes lines
which will be your value on how many lines you want to have in a output file.然后是lines
,这将是您希望在 output 文件中包含多少行的值。
You need not to run shell + awk
code, this could be done in a single awk
like shown following.您不需要运行 shell + awk
代码,这可以在单个awk
中完成,如下所示。
awk -v count="1" -v fileInitials="testFile" -v extension=".txt" -v lines="3" '
BEGIN { FS=OFS="|" }
FNR==1{
match($0,/^([^|]*\|[^|]*\|[^|]*\|[^|]*\|[^|]*)\|[^|]*(.*)/,arr)
header1=arr[1]
header2=arr[2]
outputFile=(fileInitials count extension)
next
}
{
if(prev!=count){
print (header1,sum header2 ORS val) > (outputFile)
close(outputFile)
outputFile=(fileInitials count extension)
sum=0
val=""
}
sum+=$6
val=(val?val ORS:"") $0
prev=count
count=(++countline%lines==0?++count:count)
}
END{
if(count && val){
print (header1,sum header2 ORS val) > (outputFile)
close(outputFile)
}
}
' Input_file
Here's an awk
solution for splitting the original file into files of n
records.这是一个awk
解决方案,用于将原始文件拆分为n
条记录的文件。 The idea is to accumulate the records until the given count is reached then generate a file with the updated header and the accumulated records:这个想法是累积记录直到达到给定的计数,然后生成一个包含更新后的 header 和累积记录的文件:
n=3
file=./testFile.text
awk -v numRecords="$n" '
BEGIN {
FS = OFS = "|"
if ( match(ARGV[1],/[^\/]\.[^\/]*$/) ) {
filePrefix = substr(ARGV[1],1,RSTART)
fileSuffix = substr(ARGV[1],RSTART+1)
} else {
filePrefix = ARGV[1]
fileSuffix = ""
}
if (getline headerStr <= 0)
exit 1
split(headerStr, headerArr)
}
(NR-2) % numRecords == 0 && recordsCount {
outfile = filePrefix "_" ++filesCount fileSuffix
print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
printf("%s", records) > outfile
close(outfile)
records = ""
recordsCount = recordsSum = 0
}
{
records = records $0 ORS
recordsCount++
recordsSum += $6
}
END {
if (recordsCount) {
outfile = filePrefix "_" ++filesCount fileSuffix
print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
printf("%s", records) > outfile
close(outfile)
}
}
' "$file"
With the given sample you'll get:使用给定的示例,您将获得:
testFile_1.text
00|STMT|08-09-2022 13:24:56||3|6.1|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
testFile_2.text
00|STMT|08-09-2022 13:24:56||2|7|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.