I would like to count number of times appear the different susbtrings into a set of strings in 2nd column inside a tab file. So, in this way I'm doing an split to separate every substring and then try to count them. However does not work correctly.
The input is like
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA AA
The desired output
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA AA=9;AC=2
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA CC AA=10;CC=1
and so on....
awk 'BEGIN {FS=OFS="\t"} {gf=split($2,gfp," ")} {for (i=1;i<=gf;i++){
if (gfp[i]=="AA"){i++; printf $1FS$2FS"%s\n" i, gfp[i]}
else if (gfp[i]=="AC" || gfp[i] == "CA"){i++; printf $1FS$2FS"%s"gfp[i]"="i";\n"}
}}' input > output
and also I'm try to do other script but I think count repeating each count the same number of times that take place for every row. Here I have performed an split under the first split to discern between substrings
awk 'BEGIN {FS=OFS="\t"} {gf=split($2,gfp," ");} {for (i=1;i<=gf;i++){
par=gfp[i];
gfeach=split($2,gfpeach,par);
print par "=" gfeach[i]";"
}
}' input > output
I'm for sure there are some more easy ways to do it but I cannot get solve completely. Is it possible to do in UNIX environment? Thanks in advance
Your input doesn't match your output so we're all just guessing but this might be what you want:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
delete cnt
split($2,tmp,/ /)
for (i in tmp) {
str = tmp[i]
cnt[str]++
}
printf "%s", $0
sep = OFS
for (str in cnt) {
printf "%s%s=%d", sep, str, cnt[str]
sep = ";"
}
print ""
}
Depending on what your input really is the above will output the following:
$ cat file
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA AA
$ awk -f tst.awk file
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA AA=9;AC=2
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA AA AA=11
$ cat file
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA CC
$ awk -f tst.awk file
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA AA=9;AC=2
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA CC AA=10;CC=1
something like this?
$ awk '{for(i=4;i<=NF;i++) c[$i]++;
for(k in c) {s=s sep k"="c[k]; sep=";"; c[k]=0}
$NF=$NF OFS s; s=sep=""}1' file | column -t
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA AA=9;AC=2
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA AA AA=11;AC=0
note that the captured letters are progressively increasing since only the observed keys up to a row will be printed. For example if you had CC
in the second row, the count won't be listed in the first line.
Could do it in perl
perl -lpe '$a{$_}++ for /\b[A-Z]{2}\b/g;
$_.=" ".join(";",map{"$_=$a{$_}"}keys%a);
%a = map{$_=>0}keys%a' file
produces
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA AA=9;AC=2
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA CC AA=10;CC=1;AC=0
For new requirement
perl -lpe '$a{$_}++ for /\b[A-Z]{2}\b/g;
$_.=" ".join(";",map{"$_=$a{$_}"}keys%a);
undef %a' file
produces
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA AC=2;AA=9
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA CC CC=1;AA=10
#!/bin/bash
strings="AA AC CC"
while read line; do
echo -n "$line: "
for name in $strings; do
num=$(echo $line | xargs -n1 | grep -cw $name)
if [[ $num -ne 0 ]]; then
echo -n "$name=$num;"
fi
done
echo
done < inputFile.txt
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.