Process multiple file using awk

Question

I've got to process lots of txt files (16 million of rows for each file) using awk. I've got to read for example ten files:

File #1:

en sample_1 200
en.n sample_2 10
en sample_3 10

File #2:

en sample_1 10
en sample_3 67

File #3:

en sample_1 1
en.n sample_2 10
en sample_4 20

...

I would like to have an output like this:

source title f1 f2 f3 sum(f1,f2,f3)

en sample_1 200 10 1 211
en.n sample_2 10 0 10 20
en sample_3 10 67 0 77
en sample_4 0 0 20 20

Here my first version of code:

#! /bin/bash
clear
#var declaration
BASEPATH=<path_to_file>
YEAR="2014"
RES_FOLDER="processed"
FINAL_RES="2014_06_01"
#results folder creation
mkdir $RES_FOLDER
#processing
awk 'NF>0{a[$1" "$2]=a[$1" "$2]" "$3}END{for(i in a){print i a[i]}}' $BASEPATH/$YEAR/* > $RES_FOLDER/$FINAL_RES

And here my output:

en sample_1 200 10 1
en.n sample_2 10 10
en sample_3 10 67
en sample_4 20

I'm a little bit confused about how to put zero column where no occurrence is found and how to get the sum of all value. I know I've to use this:

{tot[$1" "$2]+=$3} END{for (key in tot) print key, tot[key]}

Hope someone will help. Thank you.

******** EDITED ********

I'm trying to achieve my result in a different kind of way. I create a bash script like this, It produces a sorted file with all of my keys, it's very huge, about 62 millions of record, I slice this file into pieces and I pass each piece to my awk script.

BASH:

#! /bin/bash
clear
FILENAME=<result>
BASEPATH=<base_path>
mkdir processed/slice
cat $BASEPATH/dataset/* | cut -d' ' -f1,2 > $BASEPATH/processed/aggr
sort -u -k2 $BASEPATH/processed/aggr > $BASEPATH/processed/sorted
split -d -l 1000000 processed/sorted processed/slice/slice-
echo $(date "+START PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
for filename in processed/slice/*; do
  awk -v filename="$filename" -f algorithm.awk dataset/* >> processed/$FILENAME
done
echo $(date "+END PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
rm $BASEPATH/processed/aggr
rm $BASEPATH/processed/sorted
rm -rf $BASEPATH/processed/slice

AWK:

BEGIN{
while(getline < filename){
 key=$1" "$2;
 sources[key];
 for(i=1;i<11;i++){
   keys[key"-"i] = "0";
 }
}
close(filename);
}
{
if(FNR==1){
 ARGIND++;
}
key=$1" "$2;
keys[key"-"ARGIND] = $3
}END{
for (s in sources) {
 sum = 0
 printf "%s", s
 for (j=1;j<11;j++) {
   printf "%s%s", OFS, keys[s"-"j]
   sum += keys[s"-"j]
 }
print " "sum
}
}

With awk I preallocate my final array, and reading dataset/* folder I populate its content. I've figured out that my bottleneck came from iterating on dataset folder by awk input (10 files with 16.000.000 lines each). Everything is working on a small set of data, but with real data, RAM (30GB) congested. Does anyone have any suggestions or advices? Thank you.

Answer 1

$ cat tst.awk
{
    key = $1" "$2
    keys[key]
    val[key,ARGIND] = $3
}
END {
    for (key in keys) {
        sum = 0
        printf "%s", key
        for (fileNr=1;fileNr<=ARGIND;fileNr++) {
            printf "%s%s", OFS, val[key,fileNr]+0
            sum += val[key,fileNr]
        }
        print sum
    }
}

$ awk -f tst.awk file1 file2 file3
en sample_4 0 0 2020
en.n sample_2 10 0 1020
en sample_1 200 10 1211
en sample_3 10 67 077

The above uses GNU awk for ARGIND, with other awks just add a line FNR==1{ARGIND++} at the start. Pipe the output to sort if necessary.

Answer 2

awk -vn="<source> <title>" 'function w(m,p){while(split(a[m],t)!=b+2)sub(p," 0&",a[m])}FNR<2{f=FILENAME;o=o?o" <"f">":"<"f">";q=q?q","f:f;++b}{a[$1" "$2]=a[$1" "$2]?a[$1" "$2]" "$NF:$0;w($1" "$2," [^ ]*$");c[$1" "$2]+=$NF}END{print n,o,"sum<("q")>";for(i in a){w(i,"$");print a[i],c[i]|"sort -k2"}}' *
<source> <title> <f1> <f2> <f3> sum<(f1,f2,f3)>
en sample_1 200 10 1 211
en.n sample_2 10 0 10 20
en sample_3 10 67 0 77
en sample_4 0 0 20 20

Answer 3

Since your files are quite large, you might want to use join -- it might be faster and/or use less memory. However it requires the files to be sorted and to have a single join field.

join -a1 -a2 -e0 -o0,1.2,2.2     <(sed $'s/ /\034/' file1 | sort) \
                                 <(sed $'s/ /\034/' file2 | sort) | 
join -a1 -a2 -e0 -o0,1.2,1.3,2.2 - \
                                 <(sed $'s/ /\034/' file3 | sort) | 
awk '{sub(/\034/," "); print $0, $3+$4+$5}'

Explanation provided upon request

Process multiple file using awk

Question

3 answers

solution1
4 2015-11-29 04:35:43

solution2
1 2015-11-29 01:42:52

solution3
0 2015-11-29 12:55:49

Process multiple file using awk

Question

3 answers

solution1 4 2015-11-29 04:35:43

solution2 1 2015-11-29 01:42:52

solution3 0 2015-11-29 12:55:49

solution1
4 2015-11-29 04:35:43

solution2
1 2015-11-29 01:42:52

solution3
0 2015-11-29 12:55:49