[英]Awk add one more column is 3rd column equals string
I have a VCF file (tab delimited) where some "RPB" values went missing in the 2nd column and it sort of shifted the whole line to the left. 我有一个VCF文件(制表符分隔),其中第二列中缺少一些“ RPB”值,并且该类将整行向左移动。
I have the following: 我有以下几点:
1 AF1=23 AC1=23
2 RPB=123 AF1=23 AC1=23
3 AF1=23 AC1=23
I need the following: 我需要以下内容:
1 NULL AF1=23 AC1=23
2 RPB=123 AF1=23 AC1=23
3 NULL AF1=23 AC1=23
I tried that, it worked miserably..: 我试过了,效果很糟..:
awk 'if($2="AF1%" {print $1,"\t"NULL"\t", print$2, print$3}' input.vcf > output.vcf
I have to import this VCF to MySQL so the tab delimitation has to be conserved .. any idea? 我必须将此VCF导入MySQL,因此必须保留制表符分隔..有什么想法吗?
$ awk 'NF<4{sub(/\t/,"&NULL&")}1' file
1 NULL AF1=23 AC1=23
2 RPB=123 AF1=23 AC1=23
3 NULL AF1=23 AC1=23
By the way, you weren't TOO far off a functional solution with your attempt: 顺便说一下,您在尝试功能性解决方案时并不太遥远:
awk 'if($2="AF1%" {print $1,"\t"NULL"\t", print$2, print$3}' input.vcf
This minimally altered version would have produced the output you want: 这个最小改动的版本将产生您想要的输出:
awk '{if($2~/^AF1/) print $1 "\tNULL\t" $2 "\t" $3; else print}' input.vcf
but as you can see that's not a very idiomatic approach. 但是如您所见,这不是一种惯用的方法。
this awk one-liner would help you: 这个awk单线将帮助您:
kent$ awk -F'\t' -v OFS='\t' '!($2~/^RPB=/){$2="NULL\t"$2}7' file
1 NULL AF1=23 AC1=23
2 RPB=123 AF1=23 AC1=23
3 NULL AF1=23 AC1=23
IMHO you shouldn't use regex, Try this: 恕我直言,您不应该使用正则表达式,请尝试以下操作:
#!/bin/bash
cat input.vcf |\
perl -ane '
BEGIN{$c=0;$max_fields=0}
$c2=0;
foreach(@F){
$a[$c][$c2]=$_;
if( $c2 > $max_fields ) {
$max_fields=$c2;
}
$c2++
}
$c++;
END{
foreach $i (@a){
while (@$i < $max_fields + 1 ){
unshift (@$i,"NULL");
}
}
foreach $i (@a){
foreach $x (@$i){
print $x,"\t";
}
print "\n";
}
}'
Output: 输出:
bash test.sh
NULL AF1=23 AC1=23
RPB=123 AF1=23 AC1=23
NULL AF1=23 AC1=23
Explanation: 说明:
Based on a tab delimited input file: 基于制表符分隔的输入文件:
awk -v OFS="\t" 'NF==3{$1=$1 OFS "NULL"} 1' input.vcf
where it could be altered to the following if the input file isn't tab delimited: 如果输入文件不是制表符分隔的,则可以将其更改为以下内容:
awk -v OFS="\t" '{$1=$1 (NF==3 ? OFS "NULL" : "")} 1' input.vcf
In either, when NF==3
the first field is re-assigned to contain the missing data. 在任一情况下,当NF==3
将重新分配第一个字段以包含丢失的数据。 In the first example, only the output delimiters of the altered lines need adjusting, but when the data isn't tab delimited, each line needs to be "re-computed" with re-assignment prior to the 1
which is printing the whole line. 在第一个示例中,仅需要调整更改行的输出定界符,但是当数据不是制表符分隔时,需要在打印整行的行1
之前通过重新分配来“重新计算”每行。
The beauty of Ed's answer when the input file is tab delimited is that the whole line output delimiter isn't "re-computed" when the substitution takes place, because it's the first delimiter that's being replaced. 当用制表符分隔输入文件时,Ed的答案之所以美,是因为发生替换时,整行输出分隔符不会“重新计算”,因为这是被替换的第一个分隔符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.