使用 python 包装器并行化 python 脚本

Question

I have a python script heavy_lifting.py that I have parallelized using GNU Parallel called from a bash wrapper script wrapper.sh .我有一个 python 脚本heavy_lifting.py ，我使用从 bash 包装器脚本wrapper.sh调用的 GNU Parallel 对其进行了并行化。 I use this to process fastq formatted files see example.fastq below.我用它来处理 fastq 格式的文件，请参见下面的example.fastq 。 While this works, it is inelegant to require the use of two interpreters and sets of dependencies.虽然这可行，但要求使用两个解释器和一组依赖项是不优雅的。 I would like to rewrite the bash wrapper script using python while achieving the same parallelization.我想使用 python 重写 bash 包装脚本，同时实现相同的并行化。

example.fastq This is an example of an input file that needs to be processed. example.fastq这是需要处理的输入文件的示例。 This input file is often very long (~500,000,000) lines.这个输入文件通常很长（~500,000,000）行。

@SRR6750041.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR6750041.3 3/1
ATCCANAATGATGTGTTGCTCTGGAGGTACAGAGATAACGTCAGCTGGAATAGTTTCCCCTCACAG
+
AAAAA#EE6E6EEEEEE6EEEEAEEEEEEEEEEE//EAEEEEEAAEAEEEAE/EAEEA6/EEA<E/
@SRR6750041.4 4/1
ACACCNAATGCTCTGGCCTCTCAAGCACGTGGATTATGCCAGAGAGGCCAGAGCATTCTTCGTACA
+
/AAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/E/<//AEA/EA//E//

Below are minimal reproducible examples of the scripts I am starting out with.下面是我开始使用的脚本的最小可复制示例。

heavy_lifting.py

#!/usr/bin/env python
import argparse

# Read in arguments
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--inputFastq', required=True, help='forward .fastq')
parser.add_argument('-o', '--outputFastq', required=True, help='output .fastq')
args = parser.parse_args()

# Iterate through input file and append to output file
with open(args.inputFastq, "r") as infile:
    with open(args.outputFastq, "a") as outfile:
    for line in infile:
        outfile.write("modified" + line)

wrapper.sh

#!/bin/bash

NUMCORES="4"
FASTQ_F="./fastq_F.fastq"

# split the input fastq for parallel processing. One split fastq file will be created for     each core available.
split --number="l/$NUMCORES" $FASTQ_F split_fastq_F_

# Feed split fastq files to GNU Parallel to invoke parallel executions of `heavy_lifting.py`
ls split_fastq_F* | awk -F "split_fastq_F" '{print $2}' | parallel "python  heavy_lifting.py -i split_fastq_F{} -o output.fastq"

#remove intermediate split fastq files
rm split_fastq_*

To execute these scripts I use the command bash wrapper.sh .要执行这些脚本，我使用命令bash wrapper.sh 。 You can see that a results file output.fastq is created and contains a modified fastq file.您可以看到创建了一个结果文件output.fastq并包含修改后的 fastq 文件。

Below is my attempt to invoke parallel processing using a python wrapper wrapper.py .下面是我尝试使用 python 包装器wrapper.py调用并行处理。

wrapper.py

#!/usr/bin/env python

import heavy_lifting
from joblib import Parallel, delayed
import multiprocessing

numcores = 4
fastq_F = "fastq_F.fastq"

#Create some logic to split the input fastq file into chunks for parallel processing.  

# Get input fastq file dimensions
with open(fastq_F, "r") as infile:
    length_fastq = len(infile.readlines())
    print(length_fastq)
    lines = infile.readlines()
    split_size = length_fastq / numcores
    print(split_size)

# Iterate through input fastq file writing lines to outfile in bins.
counter = 0
split_counter = 0
split_fastq_list = []
with open(fastq_F, "r") as infile:
    for line in infile:
        if counter == 0:
            filename = str("./split_fastq_F_" + str(split_counter))
            split_fastq_list.append(filename)
            outfile = open(filename, "a")
            counter += 1
        elif counter <= split_size:
            outfile.write(line.strip())
            counter += 1
        else:
            counter = 0
            split_counter += 1
            outfile.close()


Parallel(n_jobs=numcores)(delayed(heavy_lifting)(i, "output.fastq") for i in split_fastq_list)

EDITED to improve reproducibility of wrapper.py编辑以提高 wrapper.py 的可重复性

I seem to be be most confused about how to properly feed the input arguments into the invocation of "Parallel" in the python wrapper.py script.我似乎对如何正确地将输入 arguments 输入到 python wrapper.py 脚本中的“并行”调用中感到最困惑。 Any help is much appreciated!任何帮助深表感谢！

Answer 1

Parallel expects function's name, not file/module name Parallel需要函数名，而不是文件/模块名

So in heavy_lifting you have to put code in function (with arguments instead of args )所以在heavy_lifting你必须把代码放在 function （用 arguments 而不是args ）

def my_function(inputFastq, outputFastq):

    with open(inputFastq, "r") as infile:
        with open(outputFastq, "a") as outfile:
            for line in infile:
                outfile.write("modified" + line)

And then you can use然后你可以使用

Parallel(n_jobs=numcores)(delayed(heavy_lifting.my_function)(i, "output.fastq") for i in split_fastq_list)

Answer 2

This should be a comment, because it does not answer the question, but it is too big.这应该是评论，因为它没有回答问题，但它太大了。

All of wrapper.sh can be written as:所有wrapper.sh都可以写成：

parallel -a ./fastq_F.fastq --recstart @SRR --block -1 --pipepart --cat "python  heavy_lifting.py -i {} -o output.fastq"

If heavy_lifting.py only reads the file and does not seek, this should work, too, and will require less disk I/O (the temporary file is replaced with a fifo):如果heavy_lifting.py只读取文件而不查找，这也应该可以工作，并且需要更少的磁盘I/O（临时文件被替换为fifo）：

parallel -a ./fastq_F.fastq --recstart @SRR --block -1 --pipepart --fifo "python  heavy_lifting.py -i {} -o output.fastq"

It will autodetect the number of CPU threads, split the fastq-file at a line that start with @SRR, split it into one chunk per CPU thread on the fly and give that to python.它将自动检测 CPU 线程的数量，在以 @SRR 开头的行处拆分 fastq 文件，动态地将每个 CPU 线程拆分为一个块，并将其提供给 python。

If heavy_lifting.py reads from stdin when no -i is given, then this should work, too:如果在没有给出-i的情况下从标准输入读取heavy_lifting.py ，那么这也应该有效：

parallel -a ./fastq_F.fastq --recstart @SRR --block -1 --pipepart "python heavy_lifting.py -o output.fastq"

If heavy_lifting.py does not append a unique string to output.fastq , it will be overwritten.如果heavy_lifting.py没有 append 到output.fastq的唯一字符串，它将被覆盖。 So it might be better to have GNU Parallel give it a unique name like output2.fastq :所以最好让 GNU Parallel 给它一个唯一的名字，比如output2.fastq ：

parallel -a ./fastq_F.fastq --recstart @SRR --block -1 --pipepart "python heavy_lifting.py -o output{#}.fastq"

For a more general FASTQ parallel wrapper see: https://stackoverflow.com/a/41707920/363028有关更通用的 FASTQ 并行包装器，请参阅： https://stackoverflow.com/a/41707920/363028

Answer 3

For reproducibility I implemented the answer provided by furas into the heavy_lifting.py and wrapper.py scripts.为了重现性，我将 furas 提供的答案实现到了heavy_lifting.py和wrapper.py脚本中。 Additional edits were needed to make the code run which is why I am providing the following.需要进行额外的编辑才能使代码运行，这就是我提供以下内容的原因。

heavy_lifting.py

#!/usr/bin/env python
import argparse

# Read in arguments
#parser = argparse.ArgumentParser()
#parser.add_argument('-i', '--inputFastq', required=True, help='forward .fastq')
#parser.add_argument('-o', '--outputFastq', required=True, help='output .fastq')
#args = parser.parse_args()

def heavy_lifting_fun(inputFastq, outputFastq):
    # Iterate through input file and append to output file
    outfile = open(outputFastq, "a")
    with open(inputFastq, "r") as infile:
        for line in infile:
            outfile.write("modified" + line.strip() + "\n")
    outfile.close()

if __name__ == '__main__':
heavy_lifting_fun()

wrapper.py

#!/usr/bin/env python

import heavy_lifting
from joblib import Parallel, delayed
import multiprocessing

numcores = 4
fastq_F = "fastq_F.fastq"

#Create some logic to split the input fastq file into chunks for parallel processing.  

# Get input fastq file dimensions
with open(fastq_F, "r") as infile:
    length_fastq = len(infile.readlines())
    print(length_fastq)
    lines = infile.readlines()
    split_size = length_fastq / numcores
    while (split_size  % 4 != 0):
        split_size += 1
    print(split_size)

# Iterate through input fastq file writing lines to outfile in bins.
counter = 0
split_counter = 0
split_fastq_list = []
with open(fastq_F, "r") as infile:
    for line in infile:
        print(counter)
        #if counter == 0 and line[0] != "@":
        #    continue
        if counter == 0:
            filename = str("./split_fastq_F_" + str(split_counter))
            split_fastq_list.append(filename)
            outfile = open(filename, "a")
            outfile.write(str(line.strip() + "\n"))
            counter += 1
        elif counter < split_size:
            outfile.write(str(line.strip() + "\n"))
            counter += 1
        else:
            counter = 0
            split_counter += 1
            outfile.close()
            filename = str("./split_fastq_F_" + str(split_counter))
            split_fastq_list.append(filename)
            outfile = open(filename, "a")
            outfile.write(str(line.strip() + "\n"))
            counter += 1
    outfile.close()

Parallel(n_jobs=numcores)(delayed(heavy_lifting.heavy_lifting_fun)(i, "output.fastq") for i in split_fastq_list)

使用 python 包装器并行化 python 脚本

问题描述

EDITED to improve reproducibility of wrapper.py编辑以提高 wrapper.py 的可重复性

3 个解决方案

解决方案1
1 已采纳 2021-01-03 23:14:04

解决方案2
1 2021-01-05 14:22:36

解决方案3
0 2021-01-04 17:09:19

使用 python 包装器并行化 python 脚本

问题描述

EDITED to improve reproducibility of wrapper.py编辑以提高 wrapper.py 的可重复性

3 个解决方案

解决方案1 1 已采纳 2021-01-03 23:14:04

解决方案2 1 2021-01-05 14:22:36

解决方案3 0 2021-01-04 17:09:19

解决方案1
1 已采纳 2021-01-03 23:14:04

解决方案2
1 2021-01-05 14:22:36

解决方案3
0 2021-01-04 17:09:19