简体   繁体   中英

Count number of different occurrences in a string by UNIX along one column into a file

I would like to count number of times appear the different susbtrings into a set of strings in 2nd column inside a tab file. So, in this way I'm doing an split to separate every substring and then try to count them. However does not work correctly.

The input is like

rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA AA

The desired output

rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA   AA=9;AC=2
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC   AA=10;CC=1

and so on....

awk 'BEGIN {FS=OFS="\t"} {gf=split($2,gfp," ")} {for (i=1;i<=gf;i++){
                                      if (gfp[i]=="AA"){i++; printf $1FS$2FS"%s\n" i, gfp[i]}
                                      else if (gfp[i]=="AC" || gfp[i] == "CA"){i++; printf $1FS$2FS"%s"gfp[i]"="i";\n"}
                                                            }}' input > output

and also I'm try to do other script but I think count repeating each count the same number of times that take place for every row. Here I have performed an split under the first split to discern between substrings

awk 'BEGIN {FS=OFS="\t"} {gf=split($2,gfp," ");} {for (i=1;i<=gf;i++){

                     par=gfp[i];
                     gfeach=split($2,gfpeach,par);
                     print par "=" gfeach[i]";"
                                              }
                      }' input > output

I'm for sure there are some more easy ways to do it but I cannot get solve completely. Is it possible to do in UNIX environment? Thanks in advance

Your input doesn't match your output so we're all just guessing but this might be what you want:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    delete cnt
    split($2,tmp,/ /)
    for (i in tmp) {
        str = tmp[i]
        cnt[str]++
    }

    printf "%s", $0
    sep = OFS
    for (str in cnt) {
        printf "%s%s=%d", sep, str, cnt[str]
        sep = ";"
    }
    print ""
}

Depending on what your input really is the above will output the following:

$ cat file
rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA AA

$ awk -f tst.awk file
rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA        AA=9;AC=2
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA AA        AA=11

$ cat file
rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC

$ awk -f tst.awk file
rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA        AA=9;AC=2
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC        AA=10;CC=1

something like this?

$ awk '{for(i=4;i<=NF;i++) c[$i]++; 
        for(k in c) {s=s sep k"="c[k]; sep=";"; c[k]=0} 
        $NF=$NF OFS s; s=sep=""}1' file | column -t

rs12255619  A/C  chr10  AA  AA  AC  AA  AA  AA  AA  AA  AA  AC  AA  AA=9;AC=2
rs7909677   A/G  chr10  AA  AA  AA  AA  AA  AA  AA  AA  AA  AA  AA  AA=11;AC=0

note that the captured letters are progressively increasing since only the observed keys up to a row will be printed. For example if you had CC in the second row, the count won't be listed in the first line.

Could do it in perl

perl -lpe '$a{$_}++ for /\b[A-Z]{2}\b/g;
           $_.=" ".join(";",map{"$_=$a{$_}"}keys%a);
           %a = map{$_=>0}keys%a' file

produces

rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA AA=9;AC=2
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC AA=10;CC=1;AC=0

For new requirement

perl -lpe '$a{$_}++ for /\b[A-Z]{2}\b/g;
           $_.=" ".join(";",map{"$_=$a{$_}"}keys%a);
           undef %a' file

produces

rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA AC=2;AA=9
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC CC=1;AA=10
#!/bin/bash

strings="AA AC CC"

while read line; do
        echo -n "$line: "
        for name in $strings; do
                num=$(echo $line | xargs -n1 | grep -cw $name)
                if [[ $num -ne 0 ]]; then
                        echo -n "$name=$num;"
                fi
        done
        echo
done < inputFile.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM