简体   繁体   English

awk再加一列是第三列等于字符串

[英]Awk add one more column is 3rd column equals string

I have a VCF file (tab delimited) where some "RPB" values went missing in the 2nd column and it sort of shifted the whole line to the left. 我有一个VCF文件(制表符分隔),其中第二列中缺少一些“ RPB”值,并且该类将整行向左移动。

I have the following: 我有以下几点:

1   AF1=23  AC1=23
2   RPB=123 AF1=23  AC1=23
3   AF1=23  AC1=23

I need the following: 我需要以下内容:

1   NULL    AF1=23  AC1=23
2   RPB=123 AF1=23  AC1=23
3   NULL    AF1=23  AC1=23

I tried that, it worked miserably..: 我试过了,效果很糟..:

awk 'if($2="AF1%" {print $1,"\t"NULL"\t", print$2, print$3}' input.vcf > output.vcf

I have to import this VCF to MySQL so the tab delimitation has to be conserved .. any idea? 我必须将此VCF导入MySQL,因此必须保留制表符分隔..有什么想法吗?

$ awk 'NF<4{sub(/\t/,"&NULL&")}1' file
1       NULL    AF1=23  AC1=23
2       RPB=123 AF1=23  AC1=23
3       NULL    AF1=23  AC1=23

By the way, you weren't TOO far off a functional solution with your attempt: 顺便说一下,您在尝试功能性解决方案时并不太遥远:

awk 'if($2="AF1%" {print $1,"\t"NULL"\t", print$2, print$3}' input.vcf

This minimally altered version would have produced the output you want: 这个最小改动的版本将产生您想要的输出:

awk '{if($2~/^AF1/) print $1 "\tNULL\t" $2 "\t" $3; else print}' input.vcf

but as you can see that's not a very idiomatic approach. 但是如您所见,这不是一种惯用的方法。

this awk one-liner would help you: 这个awk单线将帮助您:

kent$  awk -F'\t' -v OFS='\t' '!($2~/^RPB=/){$2="NULL\t"$2}7' file
1       NULL    AF1=23  AC1=23
2       RPB=123 AF1=23  AC1=23
3       NULL    AF1=23  AC1=23

IMHO you shouldn't use regex, Try this: 恕我直言,您不应该使用正则表达式,请尝试以下操作:

#!/bin/bash
cat input.vcf |\
perl -ane '
    BEGIN{$c=0;$max_fields=0}
    $c2=0;
    foreach(@F){
        $a[$c][$c2]=$_;
        if( $c2  > $max_fields ) {
            $max_fields=$c2; 
        }
        $c2++
    }
    $c++;
    END{
        foreach $i (@a){
            while (@$i < $max_fields + 1 ){
                unshift (@$i,"NULL");   
            }  
        }
        foreach $i (@a){
            foreach $x (@$i){
                print $x,"\t";
            }
            print "\n";
        }
    }'

Output: 输出:

bash test.sh 
NULL    AF1=23  AC1=23  
RPB=123 AF1=23  AC1=23  
NULL    AF1=23  AC1=23  

Explanation: 说明:

  1. The code above creates a 2D array (row/fields) 上面的代码创建2D数组(行/字段)
  2. It also stores max_fields 它还存储max_fields
  3. for each row, if number of fields is less than max_fields than insert "NULL" a the beginnig of the row 对于每一行,如果字段数小于max_fields,则插入“ NULL”作为该行的beginnig

Based on a tab delimited input file: 基于制表符分隔的输入文件:

awk -v OFS="\t" 'NF==3{$1=$1 OFS "NULL"} 1' input.vcf

where it could be altered to the following if the input file isn't tab delimited: 如果输入文件不是制表符分隔的,则可以将其更改为以下内容:

awk -v OFS="\t" '{$1=$1 (NF==3 ? OFS "NULL" : "")} 1' input.vcf

In either, when NF==3 the first field is re-assigned to contain the missing data. 在任一情况下,当NF==3将重新分配第一个字段以包含丢失的数据。 In the first example, only the output delimiters of the altered lines need adjusting, but when the data isn't tab delimited, each line needs to be "re-computed" with re-assignment prior to the 1 which is printing the whole line. 在第一个示例中,仅需要调整更改行的输出定界符,但是当数据不是制表符分隔时,需要在打印整行的行1之前通过重新分配来“重新计算”每行。

The beauty of Ed's answer when the input file is tab delimited is that the whole line output delimiter isn't "re-computed" when the substitution takes place, because it's the first delimiter that's being replaced. 当用制表符分隔输入文件时,Ed的答案之所以美,是因为发生替换时,整行输出分隔符不会“重新计算”,因为这是被替换的第一个分隔符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM