I'm working on the following script for a research project with my school
for f in $(ls Illumina_Data/Aphyllon/PE150_2016_04_05* ); do
if [[ "${f}" == *"_R1"* ]] ;then
echo "INITIALIZE THE SEQUENCE"
echo `basename " ${f%%_R1*}"`
get_organelle_from_reads.py -1 ${f%%_R1*}_R1_001.fastq.gz \
-2 ${f%%_R1*}_R2_001.fastq.gz \
-o Sequenced_Aphyllon_Data/`basename "${f%%_R1*}"` \
-R 15 -k 21,45,65,85,105 -F embplant_pt
fi
done
What we're getting with this script right now is kinda of a long name and we're wanting it to be shorter for organization sake. If you take a look at the -o command and the section that says Sequenced_Aphyllon_Data/'basename "${f%%_R1*}"'
. What this is spitting out is the entire fastq file name that we originally used of the following format
A_speciesname_IDtag_(some set of number and letters)_(some set of numbers and letters)_(some set of number and letters)_(some set of numbers and letters)
The issue I'm having is that we're wanting the A_speciesname_IDtag
section to remain, though sometimes our reads don't contain the IDtag section which makes it so we need to parse at either the second or third _
from the left. However there are always four _
from the right without fail.
So is there a way to specifically target an _
from the right of a string? From the right the amount of _
separating what we need will always remain the same but will change from the left.
grep
with a lookahead assertion?
$ s1=dog_ID1_a000_b111_c222_d333
$ s2=cat_a000_b111_c222_d333
$ grep -oP ".+(?=_\w+_\w+_\w+_\w+)" <<<$s1
dog_ID1
$ grep -oP ".+(?=_\w+_\w+_\w+_\w+)" <<<$s2
cat
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.