result = []
for i,j in combination(df.columns, 2):
result.append( [i,j, compare_func(df[i], df[j])] )
this code works well, But extremely slow. because it only use single core.
I tried DASK for parallelization, but DASK only support raw based parallelization. That is not what I want.
joblib is also slow. I guess, it copy dataframe many times for all cores.
please someone recommend me good way to use all cores.
Try:
import pandas as pd
import numpy as np
import multiprocessing as mp
from itertools import combinations
import string
def compare_func(sr1, sr2):
"""Your comparison function."""
return sr1 > sr2
def compare_func_proxy(df):
"""Proxy function to convert sequential to parallel execution."""
return (df.columns[0], df.columns[1], compare_func(df.iloc[:, 0], df.iloc[:, 1]))
if __name__ == '__main__': # Don't remove this line!
# Setup a MRE
rng = np.random.default_rng(2022)
cols = list(string.ascii_uppercase)
rows = range(10000)
data = rng.integers(1, 100, (len(rows), len(cols)))
df = pd.DataFrame(data, rows, cols)
# Parallel execution
with mp.Pool(mp.cpu_count() - 1) as p:
results = p.map(compare_func_proxy,
[df[list(c)] for c in combinations(df, 2)])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.