[英]Kmeans with groupby in dataframe and get cluster in python
我正在使用這樣的 DataFrame:
df=pd.DataFrame({'ID':['12345','55689','56964','49649','89645','0001',
'033','03330','064963','306193','03661','1666'],
'Culture':['A','A','A','A','A','A','B','B','B','B','B','B'],
'H': [30,42,25,32,12,10,4,6,5,10,24,21],
'S':[10,76,100,23,65,94,67,24,67,54,87,81],
'mean': [23,78,95,52,60,76,68,92,34,76,34,12]})
首先我通過df_1=df.loc[(df['Culture']=='A')
選擇了一組來做這樣的 kmeans
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=1)
kmeans_predict = km.predict(m)
數組([0, 2, 1, 1, 0, 0], dtype=int32)
clusters = {}
n = 0
for item in kmeans_predict:
if item in clusters:
clusters[item].append(list_x1[n])
else:
clusters[item] = [list_x1[n]]
n +=1
在更多代碼之后我得到了這樣的東西:
ID Culture S mean Cluster
12345 A 10 23 0
55689 A 76 78 2
56964 A 100 95 1
49649 A 23 52 1
89645 A 65 60 0
00001 A 94 92 0
我的目標是對這個 dataframe 中的每個組進行 kmeans,但我不想一個組一個組地做所有這些(文化,因為有超過 75 個組)。 我試過類似的東西:
def cluster(X):
k_means = KMeans(n_clusters=3).fit(m).groupby('CUL')
X['cluster'] = k_means.labels_
return X
df= cities_e.groupby('CUL').apply(cluster)
嘗試通過“文化”將所有這些聚類包含在每個組中,並在 DataFrame 中獲得它的預測聚類。
您可以簡單地將代碼包裝在 function 中並使用groupby.apply
。 但是,要獲取索引,請返回一個系列,而不是一個數組:
from sklearn.cluster import KMeans
def get_cluster(df_1):
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=0).fit(m)
kmeans_predict = km.predict(m)
return pd.Series(kmeans_predict, index=df_1.index)
df['Cluster'] = df.groupby('Culture').apply(get_cluster).droplevel(0)
Output:
ID Culture H S mean Cluster
0 12345 A 30 10 23 2
1 55689 A 42 76 78 0
2 56964 A 25 100 95 1
3 49649 A 32 23 52 2
4 89645 A 12 65 60 2
5 0001 A 10 94 76 1
6 033 B 4 67 68 1
7 03330 B 6 24 92 0
8 064963 B 5 67 34 2
9 306193 B 10 54 76 0
10 03661 B 24 87 34 2
11 1666 B 21 81 12 2
如果你想要跨不同文化的不同簇號,我們可以為每個文化分配一個組號,然后用它來修改簇號:
def get_cluster(df_1):
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=0).fit(m)
kmeans_predict = km.predict(m) + 3 * df_1['Culture_id'].iat[0]
return pd.Series(kmeans_predict, index=df_1.index)
g = df.groupby('Culture')
df['Culture_id'] = g.ngroup()
df['Cluster'] = g.apply(get_cluster).droplevel(0)
df = df.drop(columns=['Culture_id'])
Output:
ID Culture H S mean Cluster
0 12345 A 30 10 23 0
1 55689 A 42 76 78 1
2 56964 A 25 100 95 1
3 49649 A 32 23 52 0
4 89645 A 12 65 60 2
5 0001 A 10 94 76 2
6 033 B 4 67 68 3
7 03330 B 6 24 92 5
8 064963 B 5 67 34 4
9 306193 B 10 54 76 3
10 03661 B 24 87 34 4
11 1666 B 21 81 12 4
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.