How to parse out specific filenames from basename in bash

Question

I'm working on the following script for a research project with my school

for f in $(ls Illumina_Data/Aphyllon/PE150_2016_04_05* ); do
    if [[ "${f}" == *"_R1"* ]] ;then
            echo "INITIALIZE THE SEQUENCE"
            echo `basename " ${f%%_R1*}"`
            get_organelle_from_reads.py -1 ${f%%_R1*}_R1_001.fastq.gz \
              -2 ${f%%_R1*}_R2_001.fastq.gz \
              -o Sequenced_Aphyllon_Data/`basename "${f%%_R1*}"` \
              -R 15 -k 21,45,65,85,105 -F embplant_pt
    fi
done

What we're getting with this script right now is kinda of a long name and we're wanting it to be shorter for organization sake. If you take a look at the -o command and the section that says Sequenced_Aphyllon_Data/'basename "${f%%_R1*}"' . What this is spitting out is the entire fastq file name that we originally used of the following format

A_speciesname_IDtag_(some set of number and letters)_(some set of numbers and letters)_(some set of number and letters)_(some set  of numbers and letters)

The issue I'm having is that we're wanting the A_speciesname_IDtag section to remain, though sometimes our reads don't contain the IDtag section which makes it so we need to parse at either the second or third _ from the left. However there are always four _ from the right without fail.

So is there a way to specifically target an _ from the right of a string? From the right the amount of _ separating what we need will always remain the same but will change from the left.

Answer 1

grep with a lookahead assertion?

$ s1=dog_ID1_a000_b111_c222_d333
$ s2=cat_a000_b111_c222_d333
$ grep -oP ".+(?=_\w+_\w+_\w+_\w+)" <<<$s1
dog_ID1
$ grep -oP ".+(?=_\w+_\w+_\w+_\w+)" <<<$s2
cat

How to parse out specific filenames from basename in bash

Question

1 answers

solution1
0 2020-11-14 10:04:28

How to parse out specific filenames from basename in bash

Question

1 answers

solution1 0 2020-11-14 10:04:28

solution1
0 2020-11-14 10:04:28