简体   繁体   中英

How to column parallelize pandas dataframe to column compare?

result = []
for i,j in combination(df.columns, 2):
   result.append( [i,j, compare_func(df[i], df[j])] ) 

this code works well, But extremely slow. because it only use single core.

I tried DASK for parallelization, but DASK only support raw based parallelization. That is not what I want.

joblib is also slow. I guess, it copy dataframe many times for all cores.

please someone recommend me good way to use all cores.

Try:

import pandas as pd
import numpy as np
import multiprocessing as mp
from itertools import combinations
import string

def compare_func(sr1, sr2):
    """Your comparison function."""
    return sr1 > sr2

def compare_func_proxy(df):
    """Proxy function to convert sequential to parallel execution."""
    return (df.columns[0], df.columns[1], compare_func(df.iloc[:, 0], df.iloc[:, 1]))

if __name__ == '__main__':  # Don't remove this line!
    # Setup a MRE
    rng = np.random.default_rng(2022)
    cols = list(string.ascii_uppercase)
    rows = range(10000)
    data = rng.integers(1, 100, (len(rows), len(cols)))
    df = pd.DataFrame(data, rows, cols)

    # Parallel execution
    with mp.Pool(mp.cpu_count() - 1) as p:
        results = p.map(compare_func_proxy,
                        [df[list(c)] for c in combinations(df, 2)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM