从.gff文件中使用sed / awk / grep提取子字符串

Question

I have a file containing multiple lines like this: 我有一个包含多行的文件，如下所示：

NODE_1_length   Prodigal:2.6    CDS     11      274     .       +       0       ID=PROKKA_00001;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00001;product=hypothetical protein

And I want to extract the ID=PROKKA_[whatever number] and everything that comes after 'product=' to obtain an output like this: 我想提取ID = PROKKA_ [任何数字]和'product ='之后的所有内容，以获得如下输出：

ID=PROKKA_00001 product=hypothetical protein

I am not very skilled in using sed, so I tried to adapt some solutions I found here and around but didn't manage to get through. 我在使用sed方面不是很熟练，因此我尝试调整一些在这里和周围找到的解决方案，但没有成功。 It is also fine if the solution comes in two step (one for the ID, one for the product), then I can merge the two results in a single file. 如果解决方案分两个步骤（一个用于ID，一个用于产品），也可以，那么我可以将两个结果合并到一个文件中。

I would be grateful if you could include an explanation of the regex used. 如果您能说明所用的正则表达式，将不胜感激。

So far I tried to split the problem in two (starting from the ID) and tried: 到目前为止，我尝试将问题一分为二（从ID出发）并尝试：

grep -o 'ID=PROKKA_[0-9]{1,5}*'
sed 's/^ID=PROKKA[0-9]*;//g/
grep -Po 'ID="K[^"]*'

but of course none of them worked. 但当然他们都不起作用。 Thanks for helping! 感谢您的帮助！

Answer 1

You may use grep -oE : 您可以使用grep -oE ：

grep -oE 'ID=PROKKA_[0-9]+|product=[^;:]+' file

ID=PROKKA_00001
product=hypothetical protein

If you want result in same line then use grep + paste : 如果要在同一行中显示结果，请使用grep + paste ：

grep -oE 'ID=PROKKA_[0-9]+|product=[^;:]+' file | paste -s

从.gff文件中使用sed / awk / grep提取子字符串

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-07-16 14:28:53

从.gff文件中使用sed / awk / grep提取子字符串

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-07-16 14:28:53

解决方案1
2 已采纳 2018-07-16 14:28:53