简体   繁体   中英

Extract a pattern using awk in a specific column

I would like to modify a file (gff3 format) by taking only one specific part of the last column!

My file looks like this with the nine columns separated by tab spaces:

NW_015494524.1 Gnomon CDS 1220137 1220159 . - 0 ID=cds20267;Parent=rna22739;Dbxref=GeneID:107513619,Genbank:XP_016006018.1;Name=XP_016006018.1;gbkey=CDS;gene=A3GALT2;product=alpha_1%2C3-galactosyltransferase_2 protein_id=XP_016006018.1

I would like to extract only my gene name (;gene=XXX;) present in the last column ($9). Output:

NW_015494524.1 Gnomon CDS 1220137 1220159 . - 0 A3GALT2

After this done, I would like to combine column 4,5,7,8 and the extracted value from 9th col in a unique column Expected Output:

A3GALT2 1220137 1220159 - 0

I have tried to use awk to take only the pattern gene=xxxx in the last column. My gene name are upper case letters with or without numbers; and are delimited by ';' semicolon in the ninth column.

awk  FS "[ \t]" '$9 ~/gene=[A-Z0-9]$/ {print $0, $4, $5, $7, $8}' <file>

It is not working. Is there another way to do it with awk or maybe sed or grep are better ?

Thank you for the help in advance.

Following awk should help you in same.

awk '{sub(/.*gene=/,"",$(NF-1));sub(/\;.*/,"",$(NF-1));$NF=""} 1'  Input_file

Output will be as follows.

NW_015494524.1 Gnomon CDS 1220137 1220159 . - 0 A3GALT2

EDIT: As I had mentioned in comments too I am confused which output you need in case you need your second shown output following may help you in same.

awk '$9 ~ /.*gene=/{sub(/.*gene=/,"",$(NF-1));sub(/\;.*/,"",$(NF-1));print $9,$4,$5,$7,$8} '  Input_file

Output will be as follows.

A3GALT2 1220137 1220159 - 0

awk solution:

awk '{ split($9,a,";"); print substr(a[6],6),$4,$5,$7,$8 }' file
  • split($9,a,";") - split the 9th field into array of chunks a using ; as separator

  • substr(a[6],6) - extracting the needed gene name from substring gene=XXXXXXXX

The output:

A3GALT2 1220137 1220159 - 0

a simple awk solution

$ awk '{match($9,/gene=(\w+);/,a); print a[1],$4,$5,$7,$8}' file
A3GALT2 1220137 1220159 - 0

{match($9,/gene=(\\w+);/,a); : This will match the regex gene=(\\w+); in $9 and capture group (\\w+) which will be stored in array a and that's it.

thanks for the replies and help. Yes I would like the output as you made it. Keep only the gene name, position, strand and phase info. they will be used as header for new fasta seqs. I will try those commands.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM