简体   繁体   中英

Selecting specific rows of a tab-delimited file using bash (linux)

I have a directory lot of txt tab-delimited files with several rows and columns, eg

File1
Id    Sample   Time ...  Variant[Column16] ...
1     s1       t0        c.B481A:p.G861S
2     s2       t2        c.C221C:p.D461W
3     s5       t1        c.G31T:p.G61R
File2
Id    Sample   Time ...  Variant[Column16] ...
1     s1       t0        c.B481A:p.G861S
2     s2       t2        c.C21C:p.D61W
3     s5       t1        c.G1T:p.G1R

and what I am looking for is to create a new file with:

  • all the different variants uniq
  • the number of variants repeteated
  • and the file location

ie:

NewFile
Variant             Nº of repeated       Location
c.B481A:p.G861S     2                    File1,File2
c.C221C:p.D461W     1                    File1
c.G31T:p.G61R       1                    File1
c.C21C:p.D61W       1                    File2
c.G1T:p.G1R         1                    File2

I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.

Thanks!!

Pure bash. Requires version 4.0+

# two associative arrays
declare -A files
declare -A count

# use a glob pattern that matches your files
for f in File{1,2}; do
    {
        read header
        while read -ra fields; do
            variant=${fields[3]}        # use index "15" for 16th column
            (( count[$variant] += 1 ))
            files[$variant]+=",$f"
        done
    } < "$f"
done

for variant in "${!count[@]}"; do
    printf "%s\t%d\t%s\n" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done

outputs

c.B481A:p.G861S 2   File1,File2
c.G1T:p.G1R 1   File2
c.C221C:p.D461W 1   File1
c.G31T:p.G61R   1   File1
c.C21C:p.D61W   1   File2

The order of the output lines is indeterminate: associative arrays have no particular ordering.

I have a directory lot of txt tab-delimited files with several rows and columns, eg

File1
Id    Sample   Time ...  Variant[Colummn16] ...
1     s1       t0        c.B481A:p.G861S
2     s2       t2        c.C221C:p.D461W
3     s5       t1        c.G31T:p.G61R
File2
Id    Sample   Time ...  Variant[Colummn16] ...
1     s1       t0        c.B481A:p.G861S
2     s2       t2        c.C21C:p.D61W
3     s5       t1        c.G1T:p.G1R

and what I am looking for is to create a new file with:

  • all the different variants uniq
  • the number of variants repeteated
  • and the file location

ie:

NewFile
Variant             Nº of repeated       Location
c.B481A:p.G861S     2                    File1,File2
c.C221C:p.D461W     1                    File1
c.G31T:p.G61R       1                    File1
c.C21C:p.D61W       1                    File2
c.G1T:p.G1R         1                    File2

I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.

Thanks!!

Pure bash would be hard I think but everyone has some awk lying around:D

awk 'FNR==1{next}
{
  ++n[$16];
  if ($16 in a) {
    a[$16]=a[$16]","ARGV[ARGIND]
  }else{
    a[$16]=ARGV[ARGIND]
  }
}
END{
printf("%-24s %6s    %s\n","Variant","Nº","Location");
for (v in n) printf("%-24s %6d    %s\n",v,n[v],a[v])}' *

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM