简体   繁体   中英

awk Count number of occurrences

I made this awk command in a shell script to count total occurrences of the $4 and $5.

awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l

awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l

The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.

This is input format:

>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14

I want output like this in the shell or in file for a single occurrences not other patterns.

AG 2
CT 1
TC 1
TA 1

Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:

awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
  • Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
  • n++ increments a counter whenever the condition is matched.
  • The magic condition END is true after the last line of input has been processed.

Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?

Oh, and you might want to confirm whether you really need -F" " . By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.


UPDATE #1 based on the edited question...

If what you're really after is a pair counter, an awk array may be the way to go. Something like this:

awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt

Here's the breakdown.

  • The first statement runs on every line, and increments a counter that is the index on an array ( a[] ) whose key is build from $4 and $5 .
  • In the END block, we step through the array in a for loop, and for each index, print the index name and the value.

The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.

Example:

$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2

UPDATE #2 based on the revised input data and previously undocumented requirements.

With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:

$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2

This works by first (in the magic BEGIN block) defining an array, v[] , to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.

At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.

#!/usr/bin/awk -f

BEGIN {
  v["G"]; v["A"]; v["C"]; v["T"]
}

$4 in v && $5 in v {
  a[$4 $5]++
}

END {
  for (p in a)
    printf("%s %d\n", p, a[p])
}

Much easier to read that way.

And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.

#!/usr/bin/awk -f

BEGIN {
  a["AG"]; a["TA"]; a["CT"]; a["TC"]
}

($4 $5) in a {
  a[$4 $5]++
}

END {
  for (p in a)
    printf("%s %d\n", p, a[p])
}

This only validates things that already have array indices, which are NULL per BEGIN .

The parentheses in the increment condition are not required, and are included only for clarity.

Just count them all then print the ones you care about:

$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1

Note that this will produce a count of zero for any of your target pairs that don't appear in your input, eg if you want a count of "XY"s too:

$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0

If that's desirable, check if other solutions do the same.

Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:

$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM