I have a malformed variant call file that I am trying to fix. I've been trying to figure out a way to fix it but am having some trouble. Here's a snippet of the file:
##fileformat=VCFv4.1
##fileDate=20151024
##INFO=<ID=ALT,Number=1,Type=String,Description="Allele B">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT D00270
1 167945 . C . . ALT=T GT 0/0
1 290868 . G T . . ALT=T GT 0/1
1 700273 . A C . . ALT=C GT 0/1
1 744314 . G A . . ALT=A GT 1/0
1 765121 . A G . . ALT=G GT 0/1
1 1047386 . G A . . ALT=A GT 1/0
1 1113115 . T C . . ALT=C GT 1/0
1 1623724 . G . . ALT=A GT 0/0
1 1627611 . G . . ALT=C GT 0/0
1 1664597 . T C . . ALT=C GT 1/1
1 1670775 . T C . . ALT=C GT 1/1
In some instances, there is nothing in the ALT column, but there needs to be in order for the file to be formed correctly and useful in downstream analyses. The data that should be in the ALT column is to the right of the ALT= in the INFO column. How could I replace the blank data in the ALT column with the letter to the right of the equal sign in the INFO column? The ideal output would look like this:
##fileformat=VCFv4.1
##fileDate=20151024
##INFO=<ID=ALT,Number=1,Type=String,Description="Allele B">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT D00270
1 167945 . C T . . ALT=T GT 0/0
1 290868 . G T . . ALT=T GT 0/1
1 700273 . A C . . ALT=C GT 0/1
1 744314 . G A . . ALT=A GT 1/0
1 765121 . A G . . ALT=G GT 0/1
1 1047386 . G A . . ALT=A GT 1/0
1 1113115 . T C . . ALT=C GT 1/0
1 1623724 . G A . . ALT=A GT 0/0
1 1627611 . G C . . ALT=C GT 0/0
1 1664597 . T C . . ALT=C GT 1/1
1 1670775 . T C . . ALT=C GT 1/1
Thank you for any suggestions you might have. The file is tab delimited if that is helpful.
you can try,
awk -vOFS="\t" '
NF==9 && $0 !~ /^#/ {
split($7,a,"="); #extract base from column 7
$4=$4"\t"a[2]; #Adding column
}
$0 !~ /^##/ {$1=$1;} #recompile $0 with output field separator
1' file #print
you get,
##fileformat=VCFv4.1
##fileDate=20151024
##INFO=<ID=ALT,Number=1,Type=String,Description="Allele B">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT D00270
1 167945 . C T . . ALT=T GT 0/0
1 290868 . G T . . ALT=T GT 0/1
1 700273 . A C . . ALT=C GT 0/1
1 744314 . G A . . ALT=A GT 1/0
1 765121 . A G . . ALT=G GT 0/1
1 1047386 . G A . . ALT=A GT 1/0
1 1113115 . T C . . ALT=C GT 1/0
1 1623724 . G A . . ALT=A GT 0/0
1 1627611 . G C . . ALT=C GT 0/0
1 1664597 . T C . . ALT=C GT 1/1
1 1670775 . T C . . ALT=C GT 1/1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.