[英]Parallelizing python script with a python wrapper
I have a python script heavy_lifting.py
that I have parallelized using GNU Parallel called from a bash wrapper script wrapper.sh
.我有一个 python 脚本
heavy_lifting.py
,我使用从 bash 包装器脚本wrapper.sh
调用的 GNU Parallel 对其进行了并行化。 I use this to process fastq formatted files see example.fastq
below.我用它来处理 fastq 格式的文件,请参见下面的
example.fastq
。 While this works, it is inelegant to require the use of two interpreters and sets of dependencies.虽然这可行,但要求使用两个解释器和一组依赖项是不优雅的。 I would like to rewrite the bash wrapper script using python while achieving the same parallelization.
我想使用 python 重写 bash 包装脚本,同时实现相同的并行化。
example.fastq
This is an example of an input file that needs to be processed. example.fastq
这是需要处理的输入文件的示例。 This input file is often very long (~500,000,000) lines.这个输入文件通常很长(~500,000,000)行。
@SRR6750041.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR6750041.3 3/1
ATCCANAATGATGTGTTGCTCTGGAGGTACAGAGATAACGTCAGCTGGAATAGTTTCCCCTCACAG
+
AAAAA#EE6E6EEEEEE6EEEEAEEEEEEEEEEE//EAEEEEEAAEAEEEAE/EAEEA6/EEA<E/
@SRR6750041.4 4/1
ACACCNAATGCTCTGGCCTCTCAAGCACGTGGATTATGCCAGAGAGGCCAGAGCATTCTTCGTACA
+
/AAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/E/<//AEA/EA//E//
Below are minimal reproducible examples of the scripts I am starting out with.下面是我开始使用的脚本的最小可复制示例。
heavy_lifting.py
#!/usr/bin/env python
import argparse
# Read in arguments
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--inputFastq', required=True, help='forward .fastq')
parser.add_argument('-o', '--outputFastq', required=True, help='output .fastq')
args = parser.parse_args()
# Iterate through input file and append to output file
with open(args.inputFastq, "r") as infile:
with open(args.outputFastq, "a") as outfile:
for line in infile:
outfile.write("modified" + line)
wrapper.sh
#!/bin/bash
NUMCORES="4"
FASTQ_F="./fastq_F.fastq"
# split the input fastq for parallel processing. One split fastq file will be created for each core available.
split --number="l/$NUMCORES" $FASTQ_F split_fastq_F_
# Feed split fastq files to GNU Parallel to invoke parallel executions of `heavy_lifting.py`
ls split_fastq_F* | awk -F "split_fastq_F" '{print $2}' | parallel "python heavy_lifting.py -i split_fastq_F{} -o output.fastq"
#remove intermediate split fastq files
rm split_fastq_*
To execute these scripts I use the command bash wrapper.sh
.要执行这些脚本,我使用命令
bash wrapper.sh
。 You can see that a results file output.fastq
is created and contains a modified fastq file.您可以看到创建了一个结果文件
output.fastq
并包含修改后的 fastq 文件。
Below is my attempt to invoke parallel processing using a python wrapper wrapper.py
.下面是我尝试使用 python 包装器
wrapper.py
调用并行处理。
wrapper.py
#!/usr/bin/env python
import heavy_lifting
from joblib import Parallel, delayed
import multiprocessing
numcores = 4
fastq_F = "fastq_F.fastq"
#Create some logic to split the input fastq file into chunks for parallel processing.
# Get input fastq file dimensions
with open(fastq_F, "r") as infile:
length_fastq = len(infile.readlines())
print(length_fastq)
lines = infile.readlines()
split_size = length_fastq / numcores
print(split_size)
# Iterate through input fastq file writing lines to outfile in bins.
counter = 0
split_counter = 0
split_fastq_list = []
with open(fastq_F, "r") as infile:
for line in infile:
if counter == 0:
filename = str("./split_fastq_F_" + str(split_counter))
split_fastq_list.append(filename)
outfile = open(filename, "a")
counter += 1
elif counter <= split_size:
outfile.write(line.strip())
counter += 1
else:
counter = 0
split_counter += 1
outfile.close()
Parallel(n_jobs=numcores)(delayed(heavy_lifting)(i, "output.fastq") for i in split_fastq_list)
I seem to be be most confused about how to properly feed the input arguments into the invocation of "Parallel" in the python wrapper.py script.我似乎对如何正确地将输入 arguments 输入到 python wrapper.py 脚本中的“并行”调用中感到最困惑。 Any help is much appreciated!
任何帮助深表感谢!
Parallel
expects function's name, not file/module name Parallel
需要函数名,而不是文件/模块名
So in heavy_lifting
you have to put code in function (with arguments instead of args
)所以在
heavy_lifting
你必须把代码放在 function (用 arguments 而不是args
)
def my_function(inputFastq, outputFastq):
with open(inputFastq, "r") as infile:
with open(outputFastq, "a") as outfile:
for line in infile:
outfile.write("modified" + line)
And then you can use然后你可以使用
Parallel(n_jobs=numcores)(delayed(heavy_lifting.my_function)(i, "output.fastq") for i in split_fastq_list)
This should be a comment, because it does not answer the question, but it is too big.这应该是评论,因为它没有回答问题,但它太大了。
All of wrapper.sh
can be written as:所有
wrapper.sh
都可以写成:
parallel -a ./fastq_F.fastq --recstart @SRR --block -1 --pipepart --cat "python heavy_lifting.py -i {} -o output.fastq"
If heavy_lifting.py
only reads the file and does not seek, this should work, too, and will require less disk I/O (the temporary file is replaced with a fifo):如果
heavy_lifting.py
只读取文件而不查找,这也应该可以工作,并且需要更少的磁盘I/O(临时文件被替换为fifo):
parallel -a ./fastq_F.fastq --recstart @SRR --block -1 --pipepart --fifo "python heavy_lifting.py -i {} -o output.fastq"
It will autodetect the number of CPU threads, split the fastq-file at a line that start with @SRR, split it into one chunk per CPU thread on the fly and give that to python.它将自动检测 CPU 线程的数量,在以 @SRR 开头的行处拆分 fastq 文件,动态地将每个 CPU 线程拆分为一个块,并将其提供给 python。
If heavy_lifting.py
reads from stdin when no -i
is given, then this should work, too:如果在没有给出
-i
的情况下从标准输入读取heavy_lifting.py
,那么这也应该有效:
parallel -a ./fastq_F.fastq --recstart @SRR --block -1 --pipepart "python heavy_lifting.py -o output.fastq"
If heavy_lifting.py
does not append a unique string to output.fastq
, it will be overwritten.如果
heavy_lifting.py
没有 append 到output.fastq
的唯一字符串,它将被覆盖。 So it might be better to have GNU Parallel give it a unique name like output2.fastq
:所以最好让 GNU Parallel 给它一个唯一的名字,比如
output2.fastq
:
parallel -a ./fastq_F.fastq --recstart @SRR --block -1 --pipepart "python heavy_lifting.py -o output{#}.fastq"
For a more general FASTQ parallel wrapper see: https://stackoverflow.com/a/41707920/363028有关更通用的 FASTQ 并行包装器,请参阅: https://stackoverflow.com/a/41707920/363028
For reproducibility I implemented the answer provided by furas into the heavy_lifting.py
and wrapper.py
scripts.为了重现性,我将 furas 提供的答案实现到了
heavy_lifting.py
和wrapper.py
脚本中。 Additional edits were needed to make the code run which is why I am providing the following.需要进行额外的编辑才能使代码运行,这就是我提供以下内容的原因。
heavy_lifting.py
#!/usr/bin/env python
import argparse
# Read in arguments
#parser = argparse.ArgumentParser()
#parser.add_argument('-i', '--inputFastq', required=True, help='forward .fastq')
#parser.add_argument('-o', '--outputFastq', required=True, help='output .fastq')
#args = parser.parse_args()
def heavy_lifting_fun(inputFastq, outputFastq):
# Iterate through input file and append to output file
outfile = open(outputFastq, "a")
with open(inputFastq, "r") as infile:
for line in infile:
outfile.write("modified" + line.strip() + "\n")
outfile.close()
if __name__ == '__main__':
heavy_lifting_fun()
wrapper.py
#!/usr/bin/env python
import heavy_lifting
from joblib import Parallel, delayed
import multiprocessing
numcores = 4
fastq_F = "fastq_F.fastq"
#Create some logic to split the input fastq file into chunks for parallel processing.
# Get input fastq file dimensions
with open(fastq_F, "r") as infile:
length_fastq = len(infile.readlines())
print(length_fastq)
lines = infile.readlines()
split_size = length_fastq / numcores
while (split_size % 4 != 0):
split_size += 1
print(split_size)
# Iterate through input fastq file writing lines to outfile in bins.
counter = 0
split_counter = 0
split_fastq_list = []
with open(fastq_F, "r") as infile:
for line in infile:
print(counter)
#if counter == 0 and line[0] != "@":
# continue
if counter == 0:
filename = str("./split_fastq_F_" + str(split_counter))
split_fastq_list.append(filename)
outfile = open(filename, "a")
outfile.write(str(line.strip() + "\n"))
counter += 1
elif counter < split_size:
outfile.write(str(line.strip() + "\n"))
counter += 1
else:
counter = 0
split_counter += 1
outfile.close()
filename = str("./split_fastq_F_" + str(split_counter))
split_fastq_list.append(filename)
outfile = open(filename, "a")
outfile.write(str(line.strip() + "\n"))
counter += 1
outfile.close()
Parallel(n_jobs=numcores)(delayed(heavy_lifting.heavy_lifting_fun)(i, "output.fastq") for i in split_fastq_list)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.