如何在生物信息学上并行运行 python 脚本

Question

I wish to use python to read in a fasta sequence file and convert it into a panda dataframe.我希望使用 python 读取 fasta 序列文件并将其转换为熊猫 dataframe。 I use the following scripts:我使用以下脚本：

from Bio import SeqIO
import pandas as pd

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        # print(desp)
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
        seq_df = pd.DataFrame(seqList)
        print(seq_df.shape)
        seq_df.columns=['strainName']+list(range(1, seq_df.shape[1]))
    return seq_df


if __name__ == "__main__":
    path = 'path/to/the/fasta/file'
    input = path + 'GISAIDspikeprot0119.selection.fasta'
    df = fasta2df(input)

The 'GISAIDspikeprot0119.selection.fasta' file can be found at https://drive.google.com/file/d/1F5Ir5S6h9rFsVUQkDdZpomiWo9_bXtaW/view?usp=sharing 'GISAIDspikeprot0119.selection.fasta' 文件位于https://drive.google.com/file/d/1F5Ir5S6h9rFsVUQkDdZpomiWo9_bXtaW/view?usp=sharing

The script can be run at my linux workstation only with one cpu core, but is it possible to run it with more cores (multiple processes) so that it can be run much faster?该脚本可以在我的 linux 工作站上仅使用一个 cpu 内核运行，但是是否可以使用更多内核（多个进程）运行它以便运行得更快？ What would be the codes for that?那将是什么代码？

with many thanks!非常感谢！

Answer 1

Before throwing more CPUs at your problem, you should invest some time in inspecting which parts of your code are slow.在为您的问题投入更多 CPU 之前，您应该花一些时间检查代码的哪些部分运行缓慢。

In your case, you are executing the expensive conversion seq_df = pd.DataFrame(seqList) in every loop iteration.在您的情况下，您在每次循环迭代中执行昂贵的转换seq_df = pd.DataFrame(seqList) 。 This is just wasting CPU time as the result seq_df is overwritten in the next iteration.这只是在浪费 CPU 时间，因为结果seq_df在下一次迭代中被覆盖。

Your code took over 15 minutes on my machine.你的代码在我的机器上用了 15 分钟。 After moving pd.DataFrame(seqList) and the print statement out of the loop it is down to ~15 seconds.将pd.DataFrame(seqList)和print语句移出循环后，它下降到约 15 秒。

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
    seq_df = pd.DataFrame(seqList)
    seq_df.columns = ['strainName'] + list(range(1, seq_df.shape[1]))
    return seq_df

In fact, almost all time is spend in the line seq_df = pd.DataFrame(seqList) - about 13 seconds for me.事实上，几乎所有的时间都花在seq_df = pd.DataFrame(seqList) ——对我来说大约是 13 秒。 By setting the dtype explicitly to string, we can bring it down to ~7 seconds:通过将 dtype 显式设置为 string，我们可以将其降低到 ~7 秒：

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
    seq_df = pd.DataFrame(seqList, dtype="string")
    seq_df.columns = ['strainName'] + list(range(1, seq_df.shape[1]))
    return seq_df

With this new performance, I highly doubt that you can improve the speed any further by parallel processing.有了这种新的性能，我非常怀疑您是否可以通过并行处理进一步提高速度。

如何在生物信息学上并行运行 python 脚本

问题描述

1 个解决方案

解决方案1
2 2021-01-25 19:04:04

如何在生物信息学上并行运行 python 脚本

问题描述

1 个解决方案

解决方案1 2 2021-01-25 19:04:04

解决方案1
2 2021-01-25 19:04:04