简体   繁体   中英

using sed/awk to fix a malformed file

I have a malformed variant call file that I am trying to fix. I've been trying to figure out a way to fix it but am having some trouble. Here's a snippet of the file:

##fileformat=VCFv4.1
##fileDate=20151024
##INFO=<ID=ALT,Number=1,Type=String,Description="Allele B">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  D00270
1   167945  .   C       .   .   ALT=T   GT  0/0
1   290868  .   G   T   .   .   ALT=T   GT  0/1
1   700273  .   A   C   .   .   ALT=C   GT  0/1
1   744314  .   G   A   .   .   ALT=A   GT  1/0
1   765121  .   A   G   .   .   ALT=G   GT  0/1
1   1047386 .   G   A   .   .   ALT=A   GT  1/0
1   1113115 .   T   C   .   .   ALT=C   GT  1/0
1   1623724 .   G       .   .   ALT=A   GT  0/0
1   1627611 .   G       .   .   ALT=C   GT  0/0
1   1664597 .   T   C   .   .   ALT=C   GT  1/1
1   1670775 .   T   C   .   .   ALT=C   GT  1/1

In some instances, there is nothing in the ALT column, but there needs to be in order for the file to be formed correctly and useful in downstream analyses. The data that should be in the ALT column is to the right of the ALT= in the INFO column. How could I replace the blank data in the ALT column with the letter to the right of the equal sign in the INFO column? The ideal output would look like this:

##fileformat=VCFv4.1
##fileDate=20151024
##INFO=<ID=ALT,Number=1,Type=String,Description="Allele B">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  D00270
1   167945  .   C   T   .   .   ALT=T   GT  0/0
1   290868  .   G   T   .   .   ALT=T   GT  0/1
1   700273  .   A   C   .   .   ALT=C   GT  0/1
1   744314  .   G   A   .   .   ALT=A   GT  1/0
1   765121  .   A   G   .   .   ALT=G   GT  0/1
1   1047386 .   G   A   .   .   ALT=A   GT  1/0
1   1113115 .   T   C   .   .   ALT=C   GT  1/0
1   1623724 .   G   A   .   .   ALT=A   GT  0/0
1   1627611 .   G   C   .   .   ALT=C   GT  0/0
1   1664597 .   T   C   .   .   ALT=C   GT  1/1
1   1670775 .   T   C   .   .   ALT=C   GT  1/1

Thank you for any suggestions you might have. The file is tab delimited if that is helpful.

you can try,

awk -vOFS="\t" '
    NF==9 && $0 !~ /^#/ {
        split($7,a,"=");      #extract base from column 7
        $4=$4"\t"a[2];        #Adding column
    } 
    $0 !~ /^##/ {$1=$1;}      #recompile $0 with output field separator
    1' file                   #print

you get,

##fileformat=VCFv4.1
##fileDate=20151024
##INFO=<ID=ALT,Number=1,Type=String,Description="Allele B">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  D00270
1   167945  .   C   T   .   .   ALT=T   GT  0/0
1   290868  .   G   T   .   .   ALT=T   GT  0/1
1   700273  .   A   C   .   .   ALT=C   GT  0/1
1   744314  .   G   A   .   .   ALT=A   GT  1/0
1   765121  .   A   G   .   .   ALT=G   GT  0/1
1   1047386 .   G   A   .   .   ALT=A   GT  1/0
1   1113115 .   T   C   .   .   ALT=C   GT  1/0
1   1623724 .   G   A   .   .   ALT=A   GT  0/0
1   1627611 .   G   C   .   .   ALT=C   GT  0/0
1   1664597 .   T   C   .   .   ALT=C   GT  1/1
1   1670775 .   T   C   .   .   ALT=C   GT  1/1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM