[英]Aggregating results in Pandas after using mpi4py
最后,這是我關於 StackOF 的第一個問題:
作為 uni 的一個項目,我正在嘗試從頭開始為 KMeans 編寫代碼,然后使用 mpi4py 並行運行具有隨機起始中心的不同重復。
這是代碼:
#!/usr/bin/env python
# coding: utf-8
# In[3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mpi4py import MPI
# import statistics as stat
comm=MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
print('no of processors is', size)
print('this is the process #', rank)
df = pd.read_csv('data.dat',
sep=' ',
header=None,
index_col=0, engine='python' )
n_mus = [1, 2, 4, 12] # 100]#, 1000]
cost_k = []
k_vals = range(1, 5, 2)
# k_vals = range(1, 30, 6)
for orig_n_mu in n_mus:
n_mu = orig_n_mu//size
if rank in range(orig_n_mu%size):
n_mu += 1
for k in k_vals:
cost_n = []
for n in range(1, n_mu + 1):
np.random.seed(n * k + k)
kx = np.random.uniform(df[1].min(), df[1].max(), k)
np.random.seed(n * k + k + 1)
ky = np.random.uniform(df[2].min(), df[2].max(), k)
manh = pd.DataFrame()
for c in range(k):
manh[c] = abs(df[1] - kx[c]) + abs(df[2] - ky[c])
df['center'] = manh.idxmin(axis='columns')
kx = df.groupby('center').mean()[1]
ky = df.groupby('center').mean()[2]
if df.center.unique().shape[0] != k:
print('not all centers took up clusters at the number', n,
'repetition')
print('the current number of clusters is:',
df.center.unique().shape[0], 'instead of', k)
diff = 10
while diff > 1e-4:
cost = manh.min(axis=1).mean()
for c in df.center.unique():
manh[c] = abs(df[1] - kx[c]) + abs(df[2] - ky[c])
df['center'] = manh.idxmin(axis='columns')
kx = df.groupby('center').mean()[1]
ky = df.groupby('center').mean()[2]
new_cost = manh.min(axis=1).mean()
diff = cost - new_cost
cost_n.append(new_cost)
cost_k.append([k, rank, n_mu, orig_n_mu, cost_n])
print('process #', rank, 'is done here')
all_cost = comm.gather(cost_k, root = 0)
if (rank == 0):
print('check point #1')
all_cost = np.reshape(all_cost, newshape=(-1,len(cost_k[0])))
print('the shape of all cost is', all_cost.shape)
res = pd.DataFrame(all_cost, columns=['k_val', 'rank', 'n_mu', 'orig_n_mu','cost_res'])
noruns = (res.n_mu == 0)
res = res[~noruns].copy()
res.reset_index(inplace=True, drop=True)
print('check point #2')
cost_funcs = pd.DataFrame(res.cost_res.to_list())
print('check point #3')
km_df = pd.merge(res, cost_funcs, how='outer',left_index=True, right_index=True)
print('check point #4')
km_df.drop(columns='cost_res', inplace = True)
km_df['avg_final_cost'] = cost_funcs.apply(np.nanmean, axis =1)
km_df['std_final_cost'] = cost_funcs.apply(np.nanstd, axis =1)
km_df['min_final_cost'] = cost_funcs.apply(min, axis =1)
km_df['max_final_cost'] = cost_funcs.apply(max, axis =1)
km_df.to_csv('km_df_test_para.csv')
# km_df
生成的 csv 看起來像這樣:示例 csv screenshot
這里 n 是每個核心上的運行次數,orig_n 是我應該進行分析、記錄時間、檢查標准、平均值等的運行總數。第 0、1、2、... 列是來自每次運行,列名是單個核心上的運行次數。
現在我需要將所有這些運行按 n_orig 分組。 但是不知道如何告訴 pandas 將具有相同 n_orig 和 k 的所有值放在同一行中。 如您所知,我對 mpi 也很陌生,不知道 go 如何收集我的數據。 Gather 和 Gatherv 命令不斷刪除錯誤“0”。
我會很感激你能給的任何幫助:)
我不完全理解你的意思,但也許按n_orig
和k_val
對行進行排序就足夠了?
km_df.sort_values(by=['n_orig', 'k_val'], inplace=True)
或者,也許您可以將km_df
拆分為較小的常量n_orig
和k_val
數據幀:
from itertools import product
groups = []
for n_orig, k_val in product(km_df['n_orig'].unique(), km_df['k_val'].unique()):
sel = (km_df['n_orig'] == n_orig) & (km_df['k_val'] == k_val)
groups.append(km_df[sel])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.