I would like to modify a file (gff3 format) by taking only one specific part of the last column!
My file looks like this with the nine columns separated by tab spaces:
NW_015494524.1 Gnomon CDS 1220137 1220159 . - 0 ID=cds20267;Parent=rna22739;Dbxref=GeneID:107513619,Genbank:XP_016006018.1;Name=XP_016006018.1;gbkey=CDS;gene=A3GALT2;product=alpha_1%2C3-galactosyltransferase_2 protein_id=XP_016006018.1
I would like to extract only my gene name (;gene=XXX;) present in the last column ($9). Output:
NW_015494524.1 Gnomon CDS 1220137 1220159 . - 0 A3GALT2
After this done, I would like to combine column 4,5,7,8 and the extracted value from 9th col in a unique column Expected Output:
A3GALT2 1220137 1220159 - 0
I have tried to use awk
to take only the pattern gene=xxxx in the last column. My gene name are upper case letters with or without numbers; and are delimited by ';' semicolon in the ninth column.
awk FS "[ \t]" '$9 ~/gene=[A-Z0-9]$/ {print $0, $4, $5, $7, $8}' <file>
It is not working. Is there another way to do it with awk
or maybe sed
or grep
are better ?
Thank you for the help in advance.
Following awk should help you in same.
awk '{sub(/.*gene=/,"",$(NF-1));sub(/\;.*/,"",$(NF-1));$NF=""} 1' Input_file
Output will be as follows.
NW_015494524.1 Gnomon CDS 1220137 1220159 . - 0 A3GALT2
EDIT: As I had mentioned in comments too I am confused which output you need in case you need your second shown output following may help you in same.
awk '$9 ~ /.*gene=/{sub(/.*gene=/,"",$(NF-1));sub(/\;.*/,"",$(NF-1));print $9,$4,$5,$7,$8} ' Input_file
Output will be as follows.
A3GALT2 1220137 1220159 - 0
awk solution:
awk '{ split($9,a,";"); print substr(a[6],6),$4,$5,$7,$8 }' file
split($9,a,";")
- split the 9th field into array of chunks a
using ;
as separator
substr(a[6],6)
- extracting the needed gene name from substring gene=XXXXXXXX
The output:
A3GALT2 1220137 1220159 - 0
a simple awk solution
$ awk '{match($9,/gene=(\w+);/,a); print a[1],$4,$5,$7,$8}' file
A3GALT2 1220137 1220159 - 0
{match($9,/gene=(\\w+);/,a);
: This will match the regex gene=(\\w+);
in $9
and capture group (\\w+)
which will be stored in array a
and that's it.
thanks for the replies and help. Yes I would like the output as you made it. Keep only the gene name, position, strand and phase info. they will be used as header for new fasta seqs. I will try those commands.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.