简体   繁体   中英

Awk: Sum up column values across multiple files with identical column layout

I have a number of files with the same header:

COL1, COL2, COL3, COL4

You can ignore COL1-COL3. COL4 contains a number. Each file contains about 200 rows. I am trying to sum up across the rows. For example:

File 1

COL1 COL2 COL3 COL4
 x    y   z    3
 a    b   c    4

File 2

COL1 COL2 COL3 COL4
 x     y    z   5 
 a     b    c   10 

Then a new file is returned:

COL1 COL2 COL3 COL4
 x     y    z   8 
 a     b    c   14

Is there a simple way to do this without AWK? I will use AWK if need be, I just thought there might be a simple one-liner that I could just run right away. The AWK script I have in mind feels a bit long.

Thanks

If all files have the same header - awk solution:

awk '!f && FNR==1{ f=1; print $0 }FNR>1{ s[FNR]+=$NF; $NF=""; r[FNR]=$0 }
      END{ for(i=2;i<=FNR;i++) print r[i],s[i] }' File[12]

The output (for 2 files):

COL1 COL2 COL3 COL4
x y z 8
a b c 14

This approach can be applied to multiple files (in that case you may specify globbing File* for filename expansion )

Combining paste with awk , as in Kristo Mägi's answer , is your best bet:

  • paste merges the corresponding lines from the input files,
  • which sends a single stream of input lines to awk , with each input line containing all fields to sum up.

Assuming a fixed number of input files and columns , Kristo's answer can be simplified to (making processing much more efficient):

paste file1 file2 | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'

Note: The above produces space -separated output columns, because awk 's default value for OFS , the output-field separator, is a single space.


Assuming that all files have the same column structure and line count, below is a generalization of the solution , which:

  • generalizes to more than 2 input files (and more than 2 data rows)
  • generalizes to any number of fields , as long as the field to sum up is the last one.
#!/bin/bash

files=( file1 file2 ) # array of input files
paste "${files[@]}" | awk -v numFiles=${#files[@]} -v OFS='\t' '
  {
    row = sep = ""
    for(i=1; i < NF/numFiles; ++i) { row = row sep $i; sep = OFS }
    sum = $(NF/numFiles) # last header col. / (1st) data col. to sum
    if (NR > 1) { for(i=2; i<=numFiles; ++i) sum += $(NF/numFiles * i) } # add other cols.
    printf "%s%s%s\n", row, OFS, sum
  }
'

Note that \\t (the tab char.) is used to separate output fields and that, due to relying on awk 's default line-splitting into fields, preserving the exact input whitespace between fields is not guaranteed.

You state you have "a number of files". ie, more than 2.

Given these 3 files (and should work with any number):

$ cat f1 f2 f3
COL1 COL2 COL3 COL4
 x    y   z    3
 a    b   c    4
COL1 COL2 COL3 COL4
 x     y    z   5 
 a     b    c   10 
COL1 COL2 COL3 COL4
 x     y    z   10 
 a     b    c   15 

You can do:

$ awk 'FNR==1{next}
     {sum[$1]+=$4}
     END{print "COL1 COL4"; 
         for (e in sum) print e, sum[e]} ' f1 f2 f3
COL1 COL4
x 18
a 29

It is unclear what you intend to do with COL2 or COL3, so I did not add that.

One more option.

The command:

paste f{1,2}.txt | sed '1d' | awk '{print $1,$2,$3,$4+$8}' | awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'

The result:

COL1 COL2 COL3 COL4
x y z 8
a b c 14

What it does:

Test files:

$ cat f1.txt
COL1 COL2 COL3 COL4
 x    y   z    3
 a    b   c    4

$ cat f2.txt
COL1 COL2 COL3 COL4
 x     y    z   5
 a     b    c   10

Command: paste f{1,2}.txt
Joins 2 files and gives output:

COL1 COL2 COL3 COL4 COL1 COL2 COL3 COL4
 x    y   z    3     x     y    z   5
 a    b   c    4     a     b    c   10

Command: sed '1d'
Is meant to remove header temporarily

Command: awk '{print $1,$2,$3,$4+$8}'
Returns COL1-3 and sums $4 and $8 from paste result.

Command: awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
Adds header back

EDIT:
Following @mklement0 comment, he is right about header handling as I forgot the NR==1 part.

So, I'll proxy his updated version here also:

paste f{1,2}.txt | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
$ awk '
     NR==1 { print }
     { sum[FNR]+=$NF; sub(/[^[:space:]]+[[:space:]]*$/,""); pfx[FNR]=$0 }
     END { for(i=2;i<=FNR;i++) print pfx[i] sum[i] }
' file1 file2
COL1 COL2 COL3 COL4
 x     y    z   8
 a     b    c   14

The above will work robustly and efficiently with any awk on any UNIX system, with any number of input files and with any contents of those files. The only potential problem would be that it has to retain the equivalent of 1 of those files in memory so if each file was absolutely massive then you may exhaust available memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM