I would like to cluster below dataframe for each month for column X3. How can I do that?
df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':[10,15,24,32,8,6,10,23,24,56,45,10,56]
,'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':[34,65,34,87,100,65,78,67,34,98,96,46,76]})
below is what I tried but not working
cols=df.columns[3]
def cluster(X):
k_means = KMeans(n_clusters=3).fit(X)
return X.assign(clusters=k_means.labels_)
df['cluster_id'] = df.groupby('Month')[cols].apply(cluster)
Please help thank you.
KMeans
of sklearn
often expect features to be a 2-d array, instead of a 1-d vector as you passed. So you need to modify your X
to be an array. Besides, if you want to rely on group-by-combine
mechanism, why not put column indexing within the to-apply function, since assigning from such an operation is cumbersome.
cols=df.columns[3]
def cluster(X):
feature = X[cols].to_numpy().reshape((len(X), 1))
k_means = KMeans(n_clusters=3).fit(feature)
X['cluster'] = k_means.labels_
return X
df= df.groupby('Month').apply(cluster)
You can use GroupBy.transform
to form the cluster labels. Changes to your function are:
(n_samples, 1)
so that sklearn
is happyk_means.labels_
to anything directly in the function, but returning it for transform
So
def cluster(X, n_clusters):
k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
return k_means.labels_
cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)
where we are using pd.Index
instead of a Python list to ease the addition of the string "_cluster_id"
to each element of cols
.
to get
Month X1 X2 X3 X3_cluster_id
0 1 10 12 34 1
1 1 15 90 65 0
2 1 24 20 34 1
3 1 32 40 87 2
4 1 8 10 100 2
5 1 6 15 65 0
6 3 10 30 78 2
7 3 23 40 67 2
8 3 24 60 34 0
9 3 56 42 98 1
10 3 45 2 96 1
11 3 10 4 46 0
12 3 56 10 76 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.