Count number of different occurrences in a string by UNIX along one column into a file

Question

I would like to count number of times appear the different susbtrings into a set of strings in 2nd column inside a tab file. So, in this way I'm doing an split to separate every substring and then try to count them. However does not work correctly.

The input is like

rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA AA

The desired output

rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA   AA=9;AC=2
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC   AA=10;CC=1

and so on....

awk 'BEGIN {FS=OFS="\t"} {gf=split($2,gfp," ")} {for (i=1;i<=gf;i++){
                                      if (gfp[i]=="AA"){i++; printf $1FS$2FS"%s\n" i, gfp[i]}
                                      else if (gfp[i]=="AC" || gfp[i] == "CA"){i++; printf $1FS$2FS"%s"gfp[i]"="i";\n"}
                                                            }}' input > output

and also I'm try to do other script but I think count repeating each count the same number of times that take place for every row. Here I have performed an split under the first split to discern between substrings

awk 'BEGIN {FS=OFS="\t"} {gf=split($2,gfp," ");} {for (i=1;i<=gf;i++){

                     par=gfp[i];
                     gfeach=split($2,gfpeach,par);
                     print par "=" gfeach[i]";"
                                              }
                      }' input > output

I'm for sure there are some more easy ways to do it but I cannot get solve completely. Is it possible to do in UNIX environment? Thanks in advance

Answer 1

Your input doesn't match your output so we're all just guessing but this might be what you want:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    delete cnt
    split($2,tmp,/ /)
    for (i in tmp) {
        str = tmp[i]
        cnt[str]++
    }

    printf "%s", $0
    sep = OFS
    for (str in cnt) {
        printf "%s%s=%d", sep, str, cnt[str]
        sep = ";"
    }
    print ""
}

Depending on what your input really is the above will output the following:

$ cat file
rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA AA

$ awk -f tst.awk file
rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA        AA=9;AC=2
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA AA        AA=11

$ cat file
rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC

$ awk -f tst.awk file
rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA        AA=9;AC=2
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC        AA=10;CC=1

Answer 2

something like this?

$ awk '{for(i=4;i<=NF;i++) c[$i]++; 
        for(k in c) {s=s sep k"="c[k]; sep=";"; c[k]=0} 
        $NF=$NF OFS s; s=sep=""}1' file | column -t

rs12255619  A/C  chr10  AA  AA  AC  AA  AA  AA  AA  AA  AA  AC  AA  AA=9;AC=2
rs7909677   A/G  chr10  AA  AA  AA  AA  AA  AA  AA  AA  AA  AA  AA  AA=11;AC=0

note that the captured letters are progressively increasing since only the observed keys up to a row will be printed. For example if you had CC in the second row, the count won't be listed in the first line.

Answer 3

Could do it in perl

perl -lpe '$a{$_}++ for /\b[A-Z]{2}\b/g;
           $_.=" ".join(";",map{"$_=$a{$_}"}keys%a);
           %a = map{$_=>0}keys%a' file

produces

rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA AA=9;AC=2
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC AA=10;CC=1;AC=0

For new requirement

perl -lpe '$a{$_}++ for /\b[A-Z]{2}\b/g;
           $_.=" ".join(";",map{"$_=$a{$_}"}keys%a);
           undef %a' file

produces

rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA AC=2;AA=9
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC CC=1;AA=10

Answer 4

#!/bin/bash

strings="AA AC CC"

while read line; do
        echo -n "$line: "
        for name in $strings; do
                num=$(echo $line | xargs -n1 | grep -cw $name)
                if [[ $num -ne 0 ]]; then
                        echo -n "$name=$num;"
                fi
        done
        echo
done < inputFile.txt

Count number of different occurrences in a string by UNIX along one column into a file

Question

4 answers

solution1
3 ACCPTED 2018-04-17 14:28:56

solution2
2 2018-04-17 14:27:35

solution3
2 2018-04-17 14:32:13

solution4
-1 2018-04-17 14:27:47

Count number of different occurrences in a string by UNIX along one column into a file

Question

4 answers

solution1 3 ACCPTED 2018-04-17 14:28:56

solution2 2 2018-04-17 14:27:35

solution3 2 2018-04-17 14:32:13

solution4 -1 2018-04-17 14:27:47

solution1
3 ACCPTED 2018-04-17 14:28:56

solution2
2 2018-04-17 14:27:35

solution3
2 2018-04-17 14:32:13

solution4
-1 2018-04-17 14:27:47