I have a directory lot of txt tab-delimited files with several rows and columns, eg
File1
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
and what I am looking for is to create a new file with:
ie:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.
Thanks!!
Pure bash. Requires version 4.0+
# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]} # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[@]}"; do
printf "%s\t%d\t%s\n" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done
outputs
c.B481A:p.G861S 2 File1,File2
c.G1T:p.G1R 1 File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
The order of the output lines is indeterminate: associative arrays have no particular ordering.
I have a directory lot of txt tab-delimited files with several rows and columns, eg
File1
Id Sample Time ... Variant[Colummn16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Colummn16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
and what I am looking for is to create a new file with:
ie:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.
Thanks!!
Pure bash would be hard I think but everyone has some awk lying around:D
awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %s\n","Variant","Nº","Location");
for (v in n) printf("%-24s %6d %s\n",v,n[v],a[v])}' *
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.