简体   繁体   中英

Extract substring with sed/awk/grep from .gff file

I have a file containing multiple lines like this:

NODE_1_length   Prodigal:2.6    CDS     11      274     .       +       0       ID=PROKKA_00001;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00001;product=hypothetical protein

And I want to extract the ID=PROKKA_[whatever number] and everything that comes after 'product=' to obtain an output like this:

ID=PROKKA_00001 product=hypothetical protein

I am not very skilled in using sed, so I tried to adapt some solutions I found here and around but didn't manage to get through. It is also fine if the solution comes in two step (one for the ID, one for the product), then I can merge the two results in a single file.

I would be grateful if you could include an explanation of the regex used.

So far I tried to split the problem in two (starting from the ID) and tried:

grep -o 'ID=PROKKA_[0-9]{1,5}*'
sed 's/^ID=PROKKA[0-9]*;//g/
grep -Po 'ID="K[^"]*'

but of course none of them worked. Thanks for helping!

You may use grep -oE :

grep -oE 'ID=PROKKA_[0-9]+|product=[^;:]+' file

ID=PROKKA_00001
product=hypothetical protein

If you want result in same line then use grep + paste :

grep -oE 'ID=PROKKA_[0-9]+|product=[^;:]+' file | paste -s

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM