简体   繁体   中英

Grouping a dataframe and performing operations on the resulting matrix in a parallelized manner using Python/Dask/multiprocessing?

I am working on a project where I need to group molecules in a database by their ID and perform operations on the resulting matrix. I am using Python and I want to improve performance by parallelizing the process.

I am currently loading the molecules from an SDF file and storing them in a Pandas dataframe. Each molecule has an ID, a unique Pose ID, and a unique Structure. My goal is to group the dataframe by ID and create a matrix for each ID group. The rows and columns of the matrix would correspond to the unique Pose IDs of the molecules in that ID group. Then, I can calculate values for each cell in the matrix, such as the similarity between the molecules that define that cell. However, the specific operations on the molecules are not important for this question. I am primarily asking for advice on how to set up such a system for parallelized computing using Dask or Multiprocessing, or if there are other better options.

Here is a gist of the version without any parallelisation (please note i have heavily modified to make my questions clearer, the code below outputs the desired things, but I am looking to calculate the celles on molecules not the Pose ID): https://gist.github.com/Tonylac77/abfd54b1ceef39f0d161fb6b21950edb

#Generate sample dataframe

import pandas as pd
df = pd.DataFrame(columns=['ID', 'Pose ID'])
ids = ['ID' + str(i) for i in range(1, 6)]
pose_ids = ['Pose ' + str(i) for i in range(1, 11)]
# For each ID, add 10 rows to the dataframe with the corresponding Pose IDs
df_list = []
for i in ids:
    temp_df = pd.DataFrame({'ID': [i] * 10, 'Pose ID': pose_ids})
    df_list.append(temp_df)
df= pd.concat(df_list)
print(df)

################

from tqdm import tqdm
import itertools
import functools
import numpy as np
from IPython.display import display
def full_matrix_calculation(df):
    #Here I am using just string concatenation as an example calculation, in reality i am calling external functions
    def matrix_calculation(df, id_list):
        matrices = {}
        calc_dataframes = []
        for id in tqdm(id_list):
            df_name = df[df['ID']==id]
            df_name.index = range(len(df_name['Pose ID']))
            matrix = pd.DataFrame(0.0, index=[df_name['Pose ID']], columns=df_name['Pose ID'])
            for subset in itertools.combinations(df_name['Pose ID'], 2):
                result = subset[0]+subset[1]
                matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
                matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
            matrices[id] = matrix
        return matrices
    id_list = np.unique(np.array(df['ID']))
    calculated_dfs = matrix_calculation(df, id_list)
    return calculated_dfs

calculated_dfs = full_matrix_calculation(df)
display(calculated_dfs)

I have tried using multiprocessing, however, my implementation seems to be slower than the non-parallelized version: https://gist.github.com/Tonylac77/b4bbada97ee2bab7c37d4a29079af574

def function(tuple):
    return tuple[0]+tuple[1]
def full_matrix_calculation(df):
    #Here I am using just string concatenation as an example calculation, in reality i am calling external functions
    def matrix_calculation(df, id_list):
        matrices = {}
        calc_dataframes = []
        for id in tqdm(id_list):
            df_name = df[df['ID']==id]
            df_name.index = range(len(df_name['Pose ID']))
            matrix = pd.DataFrame(0.0, index=[df_name['Pose ID']], columns=df_name['Pose ID'])
            with multiprocessing.Pool() as p:
                try:
                    results = p.map(function, itertools.combinations(df_name['Pose ID'], 2))
                except KeyError:
                    print('Incorrect clustering method selected')
                    return
                results_list = list(zip(itertools.combinations(df_name['Pose ID'], 2), results))
                for subset, result in results_list:
                    matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
                    matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
                matrices[id] = matrix
            for subset in itertools.combinations(df_name['Pose ID'], 2):
                result = subset[0]+subset[1]
                matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
                matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
            matrices[id] = matrix
        return matrices
    id_list = np.unique(np.array(df['ID']))
    calculated_dfs = matrix_calculation(df, id_list)
    return calculated_dfs

calculated_dfs = full_matrix_calculation(df)
display(calculated_dfs)

I have also started playing around with Dask, however the main issue I'm facing is that I need all of the values of one ID to be in the same dask partition, otherwise I will have incomplete matrices (if I understand correctly at least). I have tried to find a solution to this (like chunking in x partitions etc) but so far to no avail. Will update this thread if something changes.

Any advice welcome to speed these calculations up. For reference, the actual datasets I'm working contain ~10000 unique IDs and ~300000 Pose IDs. With the calculations I'm running on the molecules, some of these are taking 40h to complete.

This should be pretty straightforward using Dask Dataframe and groupBy:

ddf = your_dataframe_as_dask

def matrix_calculation(df, id):
    matrix = pd.DataFrame(0.0, index=[df['Pose ID']], columns=df_name['Pose ID'])
    for subset in itertools.combinations(df['Pose ID'], 2):
        result = subset[0]+subset[1]
        matrix.iloc[df[df['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
        matrix.iloc[df[df['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
    return matrix

ddf.groupby('ID').apply(matrix_calculation).compute()

See https://examples.dask.org/dataframes/02-groupby.html#Groupby-Apply .

This will parallelize the work for each ID.

You might then want to look at https://docs.dask.org/en/stable/scheduling.html to chose the scheduler that suits your need (default with Dataframe is threads, which might not be efficient depending on your code).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM