简体   繁体   中英

Extract text after last delimiter and attach at end of line [Linux/Ubuntu]

I have a fasta file that looks like below:

>sequence_1_g1
ATTTCGGATAA
>sequence_2_g1
AGGCTCTAGGA
>sequence_2_g2
TGTTCTGAAAT
>sequence_2_g3
CACCTCGGAGT
>sequence_3_new_g1
GCGGATAAAGC

I'd like to only extract the numbers that comes after the last delimiter and attach them to the end of each header, so that the output would look like below:

>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC

I've never used linux before and so far I've only been able to find this command to separate the text that comes after the last delimiter: sed -E 's/.*_//' filename.fasta . Can anyone give suggestions on what commands I should look for in addition to get my desired output?

1st solution: With your shown samples please try following awk code. Written and tested in GNU awk , should work in any version of it.

awk '/^>/{$0=$0 "_" substr($0,length($0))} 1' Input_file

2nd solution: Using GNU awk 's match function with regex and capturing group's values please try following.

awk 'match($0,/^>.*([0-9]+)$/,arr){$0=$0"_"arr[1]} 1'  Input_file

3rd solution: Assuming if your lines always have _g separated in lines which are getting started from > then we can simply try following awk code also.

awk -F'_g' '/^>/{$0=$0"_"$2} 1'  Input_file

4th solution: If in case perl one-liner is accepted you could simply use perl's capability of capturing groups(which will be created if a regex is having true match).

perl -pe 's/(^>.*)([0-9]+$)/\1\2_\2/'  Input_file

You may try this sed that searches > at line start and if there is a match then it matches 1+ digit at end and replaces with number_number substring expression:

sed -E '/^>/s/[0-9]+$/&_&/' file

>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC

Using sed

$ sed -E 's/.*_.([0-9]+)/&_\1/' input_file
>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM