並行處理文件組，然后使用Slurm進行串行計算

Question

我需要轉換特定目錄中的每個文件，然后在使用Slurm的系統上將結果編譯為單個計算。 每個單獨文件上的工作大約需要其余集體計算時間。 因此，我希望各個轉換同時發生。 因此，這是我需要做的：

main.sh

#!/bin/bash
#SBATCH --account=millironx
#SBATCH --time=1-00:00:00
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4

find . -maxdepth 1 -name "*.input.txt" \
  -exec ./convert-files.sh {} \;

./compile-results.sh *.output.txt

./compute.sh

echo "All Done!"

轉換文件

#!/bin/bash
# Simulate a time-intensive process
INPUT=${1%}
OUTPUT="${$INPUT/input.txt/output.txt}"
sleep 10
date > $OUTPUT

在該系統正常運行的同時，我通常處理30多個文件的批處理，並且計算時間超出了管理員僅使用一個節點時設置的時間限制。 如何並行處理文件，然后在文件全部處理完畢后對其進行編譯和計算？

我嘗試過/考慮過的

添加srun `find -exec`

find . -maxdepth 1 -name "*.input.txt" \
  -exec srun -n1 -N1 --exclusive ./convert-files.sh {} \;

find -exec等待阻塞的進程，而srun在阻塞，因此這與時間上的基本代碼完全相同。

在提交腳本中使用sbatch

find . -maxdepth 1 -name "*.input.txt" \
  -exec sbatch ./convert-files.sh {} \;

這不會在開始計算之前等待轉換完成，因此會失敗。

使用GNU並行

find . -maxdepth 1 -name "*.input.txt" | \
  parallel ./convert-files.sh

要么

find . -maxdepth 1 -name "*.input.txt" | \
  parallel srun -n1 -N1 --exclusive ./convert-files.sh

並行只能“查看”當前節點上的CPU數量，因此它一次只能處理四個文件。 更好，但仍然不是我想要的。

使用作業數組

這種方法聽起來很有希望，但由於要處理的文件名稱中沒有序號，因此我無法找到一種使之起作用的方法。

使用sbatch分別提交作業

在航站樓：

$ find . -maxdepth 1 -name "*.input.txt" \
>  -exec sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>  ./convert-files.sh {} \;

五小時后：

$ srun --account=millironx --time=30:00 --cpus-per-task=4 \
>   ./compile-results.sh *.output.txt & \
>   sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>   ./compute.sh

到目前為止，這是我提出的最佳策略，但這意味着我必須記住檢查轉換批處理的進度，並在完成轉換后立即開始計算。

將sbatch與依賴項一起使用

在航站樓：

$ find . -maxdepth 1 -name "*.input.txt" \
>  -exec sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>  ./convert-files.sh {} \;
Submitted job xxxx01
Submitted job xxxx02
...
Submitted job xxxx45
$ sbatch --account=millironx --time=30:00 --cpus-per-task=4 \
>   --dependency=after:xxxx45 --job-name=compile_results \
>   ./compile-results.sh *.output.txt & \
>   sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>   --dependency=after:compile_results \
>   ./compute.sh

我還不敢嘗試，因為我知道最后的工作並不能保證最后完成。

似乎應該很容易做到，但是我還沒有弄清楚。

Answer 1

如果您的$SLURM_NODELIST包含類似於node1,node2,node34 ，則可能可行：

find ... | parallel -S $SLURM_NODELIST convert_files

Answer 2

find . -maxdepth 1 -name "*.input.txt" | parallel srun -n1 -N1 --exclusive ./convert-files.sh find . -maxdepth 1 -name "*.input.txt" | parallel srun -n1 -N1 --exclusive ./convert-files.sh find . -maxdepth 1 -name "*.input.txt" | parallel srun -n1 -N1 --exclusive ./convert-files.sh可能是遵循的方式。 但是似乎./convert-files.sh希望將文件名作為參數，並且您正嘗試通過管道將其推入stdin 。 您需要使用xargs ，並且由於xargs可以並行工作，因此不需要parallel命令。

嘗試：

find . -maxdepth 1 -name "*.input.txt" | xargs -L1 -P$SLURM_NTASKS srun -n1 -N1 --exclusive ./convert-files.sh

-L1將按行分割find結果，並將其饋送到convert.sh ，一次生成最大$SLURM_NTASKS進程，並通過srun -n1 -N1 --exclusive將每個進程“發送”到Slurm分配的節點上的CPU中srun -n1 -N1 --exclusive 。

並行處理文件組，然后使用Slurm進行串行計算

問題描述

我嘗試過/考慮過的

添加srun `find -exec`

在提交腳本中使用sbatch

使用GNU並行

使用作業數組

使用sbatch分別提交作業

將sbatch與依賴項一起使用

2 個解決方案

解決方案1
1 2019-05-14 05:11:34

解決方案2
1 已采納 2019-05-15 14:28:50

並行處理文件組，然后使用Slurm進行串行計算

問題描述

我嘗試過/考慮過的

添加srun find -exec

在提交腳本中使用sbatch

使用GNU並行

使用作業數組

使用sbatch分別提交作業

將sbatch與依賴項一起使用

2 個解決方案

解決方案1 1 2019-05-14 05:11:34

解決方案2 1 已采納 2019-05-15 14:28:50

添加srun `find -exec`

解決方案1
1 2019-05-14 05:11:34

解決方案2
1 已采納 2019-05-15 14:28:50