简体   繁体   中英

bash: looping over the files with extra conditions

In the working directory there are several files grouped into several groups based on the end-suffix of the file name. Here is the example for 4 groups:

# group 1 has 5 files
NpXynWT_apo_300K_1.pdb
NpXynWT_apo_300K_2.pdb
NpXynWT_apo_300K_3.pdb
NpXynWT_apo_300K_4.pdb
NpXynWT_apo_300K_5.pdb
# group 2 has two files
NpXynWT_apo_340K_1.pdb
NpXynWT_apo_340K_2.pdb
# group 3 has 4 files
NpXynWT_com_300K_1.pdb
NpXynWT_com_300K_2.pdb
NpXynWT_com_300K_3.pdb
NpXynWT_com_300K_4.pdb
# group 4 has 1 file
NpXynWT_com_340K_1.pdb

I have wrote a simple bash workflow to

  1. List item pre-process each of the fille via SED: add something within each of file
  2. cat together the pre-processed files that belongs to the same group

Here is my script for the realisation of the workflow where I created an array with the names of the groups and looped it according to file index from 1 to 5

# list of 4 groups
systems=(NpXynWT_apo_300K NpXynWT_apo_340K NpXynWT_com_300K NpXynWT_com_340K)

 # loop over the groups
for model in "${systems[@]}"; do  
    # loop over the files inside of each group
    for i in {0001..0005}; do
    # edit file via SED
    sed -i "1 i\This is $i file of the group" "${pdbs}"/"${model}"_"$i"_FA.pdb
    done
# after editing cat the pre-processed filles
  cat "${pdbs}"/"${model}"_[1-5]_FA.pdb > "${output}/${model}.pdb"
done

The questions to improve this script: 1) how it would be possible to add within the inner (while) loop some checking conditions (eg by means of IF statement) to consider only existing files ? In my example the script always loops 5 files (for each group) according to the maximum number in one of the group (here 5 files in the first group)

for i in {0001..0005}; do

I would rather to loop along all of the existing files of the given group and break the while loop in the case if the file does not exist (eg considering the 4th group with only 1 file). Here is the example, which however does not work properly

 # loop over the groups with the checking of the presence of the file
for model in "${systems[@]}"; do  
    i="0"
    # loop over the files inside of each group
    for i in {0001..9999}; do
    if [ ! -f "${pdbs}/${model}_00${i}_FA.pdb" ]; then
echo 'File '${pdbs}/${model}_00${i}_FA.pdb' does not exits!'
    break
    else
    # edit file via SED
    sed -i "1 i\This is $i file of the group" "${pdbs}"/"${model}"_00"$i"_FA.pdb
    i=$[$i+1]
    fi
    done
done

Would it be possible to loop over any number of existing filles from the group (rather than just restricting to given eg very big number of files by

for i in {0001..9999}; do?
  1. You can check if a file exists with the -f test, and break if it doesn't:

     if [ ! -f "${pdbs}/${model}_${i}_FA.pdb" ]; then break fi 
  2. You existing cat command already does only count the existing files in each group, because "${pdbs}"/"${model}"_[1-5]_FA.pdb bash is performing filename expansion here, not simply expanding the [1-5] to all possible values. You can see this in the following example:

     > touch f1 f2 f5 # files f3 and f4 do not exist > echo f[1-5] f1 f2 f5 

    Notice that f[1-5] did not expand to f1 f2 f3 f4 f5 .

Update :

If you want your glob expression to match files ending in numbers bigger than 9, the [1-n] syntax will not work. The reason is that the [...] syntax defines a pattern that matches a single character. For instance, the expression foo[1-9] will match files foo1 through foo9 , but not foo10 or foo99 .

Doing something like foo[1-99] does not work, because it doesn't mean what you might think it means. The inside of the [] can contain any number of individual characters, or ranges of characters. For example, [1-9a-nxyz] would match any character from '1' through '9' , from 'a' through 'n' , or any of the characters 'x' , 'y' , or 'z' , but it would not match '0' , 'q' , 'r' , etc. Or for that matter, it would also not match any uppercase letters.

So [1-99] is not interpreted as the range of numbers from 1-99, it is interpreted as the set of characters comprised of the range from '1' to '9', plus the individual character '9'. Therefore the patterns [1-9] and [1-99] are equivalent, and will only match characters '1' through '9' . The second 9 in the latter expression is redundant.

However, you can still achieve what you want with extended globs, which you can enable with the command shopt -s extglob :

> touch f1 f2 f5 f99 f100000 f129828523
> echo f[1-99999999999]       # Doesn't work like you want it to
f1 f2 f5
> shopt -s extglob
> echo f+([0-9])
f1 f2 f5 f99 f100000 f129828523

The +([0-9]) expression is an extended glob expression composed of two parts: the [0-9] , whose meaning should be obvious at this point, and the enclosing +(...) .

The +(pattern) syntax is an extglob expression that means match one or more instances of pattern . In this case, our pattern is [0-9] , so the extglob expression +([0-9]) matches any string of digits 0-9.

However, you should note that this means it also matches things like 000000000 . If you are only interested in numbers greater than or equal to 1, you would instead do (with extglob enabled):

> echo f[1-9]*([0-9])

Note the *(pattern) here instead of +(pattern) . The * means match zero or more instances of pattern. Which we want because we've already matched the first digit with [1-9] . For instance, f[1-9]+([0-9]) does not match the filename f1 .

You may not want to leave extglob enabled in your whole script, particularly if you have any regular glob expression elsewhere in your script that might accidentally be interpreted as an extglob expression. To disable extglob when you're done with it, do:

shopt -u extglob

There's one other important thing to note here. If a glob pattern doesn't match any files, then it is interpreted as a raw string, and is left unmodified.

For example:

> echo This_file_totally_does_not_exist*
This_file_totally_does_not_exist*

Or more to the point in your case, suppose there are zero files in your 4th case, eg there are no files containing NpXynWT_com_340K . In this case, if you try to use a glob containing NpXynWT_com_340K , you get the entire glob as a literal string:

> shopt -s extglob
> echo NpXynWT_com_340K_[1-9]*([0-9])
echo NpXynWT_com_340K_[1-9]*([0-9])

This is obviously not what you want, especially in the middle of your script where you are trying to cat the matching files. Luckily there is another option you can set to make non-matching globs expand to nothing:

> shopt -s nullglob
> echo This_file_totally_does_not_exist*   # prints nothing

As with extglob , there may be unintended behavior elsewhere in your script if you leave nullglob on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM