how to parallel running of python scripts on bioinformatics

Question

I wish to use python to read in a fasta sequence file and convert it into a panda dataframe. I use the following scripts:

from Bio import SeqIO
import pandas as pd

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        # print(desp)
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
        seq_df = pd.DataFrame(seqList)
        print(seq_df.shape)
        seq_df.columns=['strainName']+list(range(1, seq_df.shape[1]))
    return seq_df


if __name__ == "__main__":
    path = 'path/to/the/fasta/file'
    input = path + 'GISAIDspikeprot0119.selection.fasta'
    df = fasta2df(input)

The 'GISAIDspikeprot0119.selection.fasta' file can be found at https://drive.google.com/file/d/1F5Ir5S6h9rFsVUQkDdZpomiWo9_bXtaW/view?usp=sharing

The script can be run at my linux workstation only with one cpu core, but is it possible to run it with more cores (multiple processes) so that it can be run much faster? What would be the codes for that?

with many thanks!

Answer 1

Before throwing more CPUs at your problem, you should invest some time in inspecting which parts of your code are slow.

In your case, you are executing the expensive conversion seq_df = pd.DataFrame(seqList) in every loop iteration. This is just wasting CPU time as the result seq_df is overwritten in the next iteration.

Your code took over 15 minutes on my machine. After moving pd.DataFrame(seqList) and the print statement out of the loop it is down to ~15 seconds.

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
    seq_df = pd.DataFrame(seqList)
    seq_df.columns = ['strainName'] + list(range(1, seq_df.shape[1]))
    return seq_df

In fact, almost all time is spend in the line seq_df = pd.DataFrame(seqList) - about 13 seconds for me. By setting the dtype explicitly to string, we can bring it down to ~7 seconds:

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
    seq_df = pd.DataFrame(seqList, dtype="string")
    seq_df.columns = ['strainName'] + list(range(1, seq_df.shape[1]))
    return seq_df

With this new performance, I highly doubt that you can improve the speed any further by parallel processing.

how to parallel running of python scripts on bioinformatics

Question

1 answers

solution1
2 2021-01-25 19:04:04

how to parallel running of python scripts on bioinformatics

Question

1 answers

solution1 2 2021-01-25 19:04:04

solution1
2 2021-01-25 19:04:04