简体   繁体   English

使用特定列中的awk提取模式

[英]Extract a pattern using awk in a specific column

I would like to modify a file (gff3 format) by taking only one specific part of the last column! 我想通过只取最后一列的一个特定部分来修改文件(gff3格式)!

My file looks like this with the nine columns separated by tab spaces: 我的文件看起来像这样,九个列由制表符空格分隔:

NW_015494524.1 Gnomon CDS 1220137 1220159 . NW_015494524.1 Gnomon CDS 1220137 1220159。 - 0 ID=cds20267;Parent=rna22739;Dbxref=GeneID:107513619,Genbank:XP_016006018.1;Name=XP_016006018.1;gbkey=CDS;gene=A3GALT2;product=alpha_1%2C3-galactosyltransferase_2 protein_id=XP_016006018.1 - 0 ID = cds20267; Parent = rna22739; Dbxref = GeneID:107513619,Genbank:XP_016006018.1; Name = XP_016006018.1; gbkey = CDS; gene = A3GALT2; product = alpha_1%2C3-galactosyltransferase_2 protein_id = XP_016006018.1

I would like to extract only my gene name (;gene=XXX;) present in the last column ($9). 我想只提取最后一栏($ 9)中的基因名称(; gene = XXX;)。 Output: 输出:

NW_015494524.1 Gnomon CDS 1220137 1220159 . NW_015494524.1 Gnomon CDS 1220137 1220159。 - 0 A3GALT2 - 0 A3GALT2

After this done, I would like to combine column 4,5,7,8 and the extracted value from 9th col in a unique column Expected Output: 完成此操作后,我想将第4,5,7,8列和第9列中提取的值组合在一个唯一的列中。预期输出:

A3GALT2 1220137 1220159 - 0 A3GALT2 1220137 1220159 - 0

I have tried to use awk to take only the pattern gene=xxxx in the last column. 我曾尝试使用awk在最后一列中仅采用模式gene = xxxx。 My gene name are upper case letters with or without numbers; 我的基因名称是带或不带数字的大写字母; and are delimited by ';' 并由';'分隔 semicolon in the ninth column. 第九栏中的分号。

awk  FS "[ \t]" '$9 ~/gene=[A-Z0-9]$/ {print $0, $4, $5, $7, $8}' <file>

It is not working. 它不起作用。 Is there another way to do it with awk or maybe sed or grep are better ? 还有另一种方法可以用awksedgrep做得更好吗?

Thank you for the help in advance. 感谢您的帮助。

Following awk should help you in same. 以下awk应该帮助你。

awk '{sub(/.*gene=/,"",$(NF-1));sub(/\;.*/,"",$(NF-1));$NF=""} 1'  Input_file

Output will be as follows. 输出如下。

NW_015494524.1 Gnomon CDS 1220137 1220159 . - 0 A3GALT2

EDIT: As I had mentioned in comments too I am confused which output you need in case you need your second shown output following may help you in same. 编辑:正如我在评论中提到的那样,我很困惑你需要哪个输出,如果你需要你的第二个显示输出,可能会帮助你。

awk '$9 ~ /.*gene=/{sub(/.*gene=/,"",$(NF-1));sub(/\;.*/,"",$(NF-1));print $9,$4,$5,$7,$8} '  Input_file

Output will be as follows. 输出如下。

A3GALT2 1220137 1220159 - 0

awk solution: awk解决方案:

awk '{ split($9,a,";"); print substr(a[6],6),$4,$5,$7,$8 }' file
  • split($9,a,";") - split the 9th field into array of chunks a using ; split($9,a,";") - 将第9个字段拆分成一个块数组a ; as separator 作为分隔符

  • substr(a[6],6) - extracting the needed gene name from substring gene=XXXXXXXX substr(a[6],6) - 从substring gene=XXXXXXXX提取所需的基因名称

The output: 输出:

A3GALT2 1220137 1220159 - 0

a simple awk solution 一个简单的awk解决方案

$ awk '{match($9,/gene=(\w+);/,a); print a[1],$4,$5,$7,$8}' file
A3GALT2 1220137 1220159 - 0

{match($9,/gene=(\\w+);/,a); : This will match the regex gene=(\\w+); :这将匹配正则表达式gene=(\\w+); in $9 and capture group (\\w+) which will be stored in array a and that's it. $9和捕获组(\\w+)将存储在数组a ,就是这样。

thanks for the replies and help. 感谢您的回复和帮助。 Yes I would like the output as you made it. 是的,我想要你输出的输出。 Keep only the gene name, position, strand and phase info. 仅保留基因名称,位置,链和相位信息。 they will be used as header for new fasta seqs. 它们将被用作新的fasta seqs的标题。 I will try those commands. 我会尝试这些命令。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM