[英]Kmeans with groupby in dataframe and get cluster in python
I am working with a DataFrame like this:我正在使用这样的 DataFrame:
df=pd.DataFrame({'ID':['12345','55689','56964','49649','89645','0001',
'033','03330','064963','306193','03661','1666'],
'Culture':['A','A','A','A','A','A','B','B','B','B','B','B'],
'H': [30,42,25,32,12,10,4,6,5,10,24,21],
'S':[10,76,100,23,65,94,67,24,67,54,87,81],
'mean': [23,78,95,52,60,76,68,92,34,76,34,12]})
And first I selected just one group by df_1=df.loc[(df['Culture']=='A')
to do kmeans like this首先我通过
df_1=df.loc[(df['Culture']=='A')
选择了一组来做这样的 kmeans
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=1)
kmeans_predict = km.predict(m)
array([0, 2, 1, 1, 0, 0], dtype=int32)
数组([0, 2, 1, 1, 0, 0], dtype=int32)
clusters = {}
n = 0
for item in kmeans_predict:
if item in clusters:
clusters[item].append(list_x1[n])
else:
clusters[item] = [list_x1[n]]
n +=1
And I got something like this after more code:在更多代码之后我得到了这样的东西:
ID Culture S mean Cluster
12345 A 10 23 0
55689 A 76 78 2
56964 A 100 95 1
49649 A 23 52 1
89645 A 65 60 0
00001 A 94 92 0
My goal is do kmeans to every group in this dataframe, but I do not want to do all this group by group (Culture, because there are more than 75 groups).我的目标是对这个 dataframe 中的每个组进行 kmeans,但我不想一个组一个组地做所有这些(文化,因为有超过 75 个组)。 I tried something like:
我试过类似的东西:
def cluster(X):
k_means = KMeans(n_clusters=3).fit(m).groupby('CUL')
X['cluster'] = k_means.labels_
return X
df= cities_e.groupby('CUL').apply(cluster)
Trying to have all this clustering inside each group by 'Culture' and get it's predicted cluster in the DataFrame.尝试通过“文化”将所有这些聚类包含在每个组中,并在 DataFrame 中获得它的预测聚类。
You could simply wrap your code in a function and use groupby.apply
.您可以简单地将代码包装在 function 中并使用
groupby.apply
。 However, to get the indexes return a Series, instead of an array:但是,要获取索引,请返回一个系列,而不是一个数组:
from sklearn.cluster import KMeans
def get_cluster(df_1):
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=0).fit(m)
kmeans_predict = km.predict(m)
return pd.Series(kmeans_predict, index=df_1.index)
df['Cluster'] = df.groupby('Culture').apply(get_cluster).droplevel(0)
Output: Output:
ID Culture H S mean Cluster
0 12345 A 30 10 23 2
1 55689 A 42 76 78 0
2 56964 A 25 100 95 1
3 49649 A 32 23 52 2
4 89645 A 12 65 60 2
5 0001 A 10 94 76 1
6 033 B 4 67 68 1
7 03330 B 6 24 92 0
8 064963 B 5 67 34 2
9 306193 B 10 54 76 0
10 03661 B 24 87 34 2
11 1666 B 21 81 12 2
If you want distinct cluster number across different Cultures, we could assign a group number for each Culture, then use it to modify cluster numbers:如果你想要跨不同文化的不同簇号,我们可以为每个文化分配一个组号,然后用它来修改簇号:
def get_cluster(df_1):
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=0).fit(m)
kmeans_predict = km.predict(m) + 3 * df_1['Culture_id'].iat[0]
return pd.Series(kmeans_predict, index=df_1.index)
g = df.groupby('Culture')
df['Culture_id'] = g.ngroup()
df['Cluster'] = g.apply(get_cluster).droplevel(0)
df = df.drop(columns=['Culture_id'])
Output: Output:
ID Culture H S mean Cluster
0 12345 A 30 10 23 0
1 55689 A 42 76 78 1
2 56964 A 25 100 95 1
3 49649 A 32 23 52 0
4 89645 A 12 65 60 2
5 0001 A 10 94 76 2
6 033 B 4 67 68 3
7 03330 B 6 24 92 5
8 064963 B 5 67 34 4
9 306193 B 10 54 76 3
10 03661 B 24 87 34 4
11 1666 B 21 81 12 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.