简体   繁体   中英

Process group of files in parallel then compute in series using slurm

I need to convert every file in a particular directory then compile the results into a single computation on a system using slurm. The work on each individual file takes about as long as the rest of the collective calculations. Therefore, I would like the individual conversions to happen simultaneously. Sequentially, this is what I need to do:

main.sh

#!/bin/bash
#SBATCH --account=millironx
#SBATCH --time=1-00:00:00
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4

find . -maxdepth 1 -name "*.input.txt" \
  -exec ./convert-files.sh {} \;

./compile-results.sh *.output.txt

./compute.sh

echo "All Done!"

convert-files.sh

#!/bin/bash
# Simulate a time-intensive process
INPUT=${1%}
OUTPUT="${$INPUT/input.txt/output.txt}"
sleep 10
date > $OUTPUT

While this system works, I generally process batches of 30+ files, and the computational time exceeds the time limit set by the administrator while only using one node. How can I process the files in parallel then compile and compute on them after they all have been completely processed?

What I've tried/considered

Adding srun to find -exec

find . -maxdepth 1 -name "*.input.txt" \
  -exec srun -n1 -N1 --exclusive ./convert-files.sh {} \;

find -exec waits for blocking processes , and srun is blocking , so this does exactly the same thing as the base code time-wise.

Using sbatch in the submission script

find . -maxdepth 1 -name "*.input.txt" \
  -exec sbatch ./convert-files.sh {} \;

This does not wait for the conversions to finish before starting the computations, and they consequently fail.

Using GNU parallel

find . -maxdepth 1 -name "*.input.txt" | \
  parallel ./convert-files.sh

OR

find . -maxdepth 1 -name "*.input.txt" | \
  parallel srun -n1 -N1 --exclusive ./convert-files.sh

parallel can only "see" the number of CPUs on the current node, so it only processes four files at a time. Better, but still not what I'm looking for.

Using job arrays

This method sounds promising , but I can't figure out a way to make it work since the files I'm processing don't have a sequential number in their names.

Submitting jobs separately using sbatch

At the terminal:

$ find . -maxdepth 1 -name "*.input.txt" \
>  -exec sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>  ./convert-files.sh {} \;

Five hours later:

$ srun --account=millironx --time=30:00 --cpus-per-task=4 \
>   ./compile-results.sh *.output.txt & \
>   sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>   ./compute.sh

This is the best strategy I've come up with so far, but it means I have to remember to check on the progress of the conversion batches and initiate the computation once they are complete.

Using sbatch with a dependency

At the terminal:

$ find . -maxdepth 1 -name "*.input.txt" \
>  -exec sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>  ./convert-files.sh {} \;
Submitted job xxxx01
Submitted job xxxx02
...
Submitted job xxxx45
$ sbatch --account=millironx --time=30:00 --cpus-per-task=4 \
>   --dependency=after:xxxx45 --job-name=compile_results \
>   ./compile-results.sh *.output.txt & \
>   sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>   --dependency=after:compile_results \
>   ./compute.sh

I haven't dared to try this yet, since I know that the last job is not guaranteed to be the last to finish.


This seems like it should be such an easy thing to do, but I haven't figured it out, yet.

如果您的$SLURM_NODELIST包含类似于node1,node2,node34 ,则可能可行:

find ... | parallel -S $SLURM_NODELIST convert_files

The find . -maxdepth 1 -name "*.input.txt" | parallel srun -n1 -N1 --exclusive ./convert-files.sh find . -maxdepth 1 -name "*.input.txt" | parallel srun -n1 -N1 --exclusive ./convert-files.sh find . -maxdepth 1 -name "*.input.txt" | parallel srun -n1 -N1 --exclusive ./convert-files.sh way probably the one to follow. But it seems ./convert-files.sh expect the filename as argument, and you are trying to push it to stdin through the pipe. You need to use xargs , and as xargs can work in parallel, you do not need the parallel command.

Try:

find . -maxdepth 1 -name "*.input.txt" | xargs -L1 -P$SLURM_NTASKS srun -n1 -N1 --exclusive ./convert-files.sh

-L1 will split the result of find per line, and feed it to convert.sh , spawning maximum $SLURM_NTASKS processes at a time, and 'sending' each of them to a CPU on the nodes allocated by Slurm thanks to srun -n1 -N1 --exclusive .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM