Optimize grep, awk and sed shell stuff

Question

I try to sum the traffic of diffrent ports in the logfiles from "IPCop" so i write and command for my shell, but i think its possible to optimize the command.

First a Line from my Logfile:

01/00:03:16 kernel INPUT IN=eth1 OUT= MAC=xxx SRC=xxx DST=xxx LEN=40 TOS=0x00 PREC=0x00 TTL=98 ID=256 PROTO=TCP SPT=47438 DPT=1433 WINDOW=16384 RES=0x00 SYN URGP=0

Now i grep with following Command the sum of all lengths who contains port 1433

grep 1433 log.dat|awk '{for(i=1;i<=10;i++)if($i ~ /LEN/)print $i};'|sed 's/LEN=//g;'|awk '{sum+=$1}END{print sum}'

The for loop i need because the LEN-col is not on same position at all time.

Any suggestion for optimizing this command?

Regards Rene

Answer 1

Since I don't have the rep to add a comment to Noufal Ibrahims answer, here is a more natural solution using Perl.

perl -ne '$sum += $1 if /LEN=(\d+)/; END { print $sum; }' log.dat

@Noufal you can can make perl do all the hard work ;).

Answer 2

If it really needs optimization, as in it runs so unbearably slow: you should probably rewrite it in a more general purpose language. Even AWK could do, but I'd suggest something closer to Perl or Java for a long running extractor.

One change you could make is, rather than using an unnecessary SED and second AWK call, move the END into the first AWK call, and use split() to extract the number from LEN=num; and add it to the accumulator. Something like split($i, x, "="); sum += x[2].

The main problem is you can't write awk '/LEN=(...)/ { sum += var matching the ... }'.

Answer 3

Any time you have grep/sed/awk combinations in a pipeline, you can simplify into a single awk or perl command. Here's an awk solution:

gawk -v dpt=1433 '
    $0 ~ dpt {
        for (i=1; i<=NF; i++) {
            if ($i ~ /^LEN=[[:digit:]]+/) {
                split($i, ary, /=/)
                sum += ary[2]
                next
            }
        } 
    } 
    END {print sum}
' log.dat

Answer 4

If you are using gawk, you can use \\< to avoid the need for the for-loop, the match(-) function to find the substring "\\<LEN=.*\\>", ie, projecting out the field you want, and substr to project out the argument of LEN. You can then use just the single awk invocation to do everything.

Postscript

The regexp I gave above doesn't work, because the = character is not part of a word. The following awk script does work:

/1433/ { f=match($0,/ LEN=[[:digit:]]+ /); v=substr($0,RSTART+5,RLENGTH-6); s+=v; }
END    { print "sum=" s; }

Answer 5

If these will be on a single line, you can use perl to extract the LOG numbers and sum it.

perl -e '$f = 0; while (<>) {/.*LEN=([0-9]+).*/ ; $f += $1;} print "$f\n";' input.log

I apologise for the bad Perl. I'm not a Perl guy at all.

Optimize grep, awk and sed shell stuff

Question

5 answers

solution1
5 2010-06-01 12:12:38

solution2
3 ACCPTED 2010-06-01 12:09:14

solution3
2 2010-06-01 17:02:55

solution4
1 2010-06-01 11:57:34

solution5
0 2010-06-01 11:57:52

Optimize grep, awk and sed shell stuff

Question

5 answers

solution1 5 2010-06-01 12:12:38

solution2 3 ACCPTED 2010-06-01 12:09:14

solution3 2 2010-06-01 17:02:55

solution4 1 2010-06-01 11:57:34

solution5 0 2010-06-01 11:57:52

solution1
5 2010-06-01 12:12:38

solution2
3 ACCPTED 2010-06-01 12:09:14

solution3
2 2010-06-01 17:02:55

solution4
1 2010-06-01 11:57:34

solution5
0 2010-06-01 11:57:52