Extract substring with sed/awk/grep from .gff file

Question

I have a file containing multiple lines like this:

NODE_1_length   Prodigal:2.6    CDS     11      274     .       +       0       ID=PROKKA_00001;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00001;product=hypothetical protein

And I want to extract the ID=PROKKA_[whatever number] and everything that comes after 'product=' to obtain an output like this:

ID=PROKKA_00001 product=hypothetical protein

I am not very skilled in using sed, so I tried to adapt some solutions I found here and around but didn't manage to get through. It is also fine if the solution comes in two step (one for the ID, one for the product), then I can merge the two results in a single file.

I would be grateful if you could include an explanation of the regex used.

So far I tried to split the problem in two (starting from the ID) and tried:

grep -o 'ID=PROKKA_[0-9]{1,5}*'
sed 's/^ID=PROKKA[0-9]*;//g/
grep -Po 'ID="K[^"]*'

but of course none of them worked. Thanks for helping!

Answer 1

You may use grep -oE :

grep -oE 'ID=PROKKA_[0-9]+|product=[^;:]+' file

ID=PROKKA_00001
product=hypothetical protein

If you want result in same line then use grep + paste :

grep -oE 'ID=PROKKA_[0-9]+|product=[^;:]+' file | paste -s

Extract substring with sed/awk/grep from .gff file

Question

1 answers

solution1
2 ACCPTED 2018-07-16 14:28:53

Extract substring with sed/awk/grep from .gff file

Question

1 answers

solution1 2 ACCPTED 2018-07-16 14:28:53

solution1
2 ACCPTED 2018-07-16 14:28:53