Group by KMeans cluster in pandas dataframe

Question

I would like to cluster below dataframe for each month for column X3. How can I do that?

 df=pd.DataFrame({'Month':[1,1,1,1,1,1,3,3,3,3,3,3,3],'X1':[10,15,24,32,8,6,10,23,24,56,45,10,56]
   ,'X2':[12,90,20,40,10,15,30,40,60,42,2,4,10],'X3':[34,65,34,87,100,65,78,67,34,98,96,46,76]})

below is what I tried but not working

cols=df.columns[3]

def cluster(X):
    k_means = KMeans(n_clusters=3).fit(X)
    return X.assign(clusters=k_means.labels_)

df['cluster_id'] = df.groupby('Month')[cols].apply(cluster)

Please help thank you.

Answer 1

KMeans of sklearn often expect features to be a 2-d array, instead of a 1-d vector as you passed. So you need to modify your X to be an array. Besides, if you want to rely on group-by-combine mechanism, why not put column indexing within the to-apply function, since assigning from such an operation is cumbersome.

cols=df.columns[3]
def cluster(X):
    feature = X[cols].to_numpy().reshape((len(X), 1))
    k_means = KMeans(n_clusters=3).fit(feature)
    X['cluster'] = k_means.labels_
    return X
    
df= df.groupby('Month').apply(cluster)

Answer 2

You can use GroupBy.transform to form the cluster labels. Changes to your function are:

Reshaping the incoming column values to (n_samples, 1) so that sklearn is happy
Not assigning the resultant k_means.labels_ to anything directly in the function, but returning it for transform

So

def cluster(X, n_clusters):
    k_means = KMeans(n_clusters=n_clusters).fit(X.values.reshape(-1, 1))
    return k_means.labels_

cols = pd.Index(["X3"])
df[cols + "_cluster_id"] = df.groupby("Month")[cols].transform(cluster, n_clusters=3)

where we are using pd.Index instead of a Python list to ease the addition of the string "_cluster_id" to each element of cols .

to get

    Month  X1  X2   X3  X3_cluster_id
0       1  10  12   34              1
1       1  15  90   65              0
2       1  24  20   34              1
3       1  32  40   87              2
4       1   8  10  100              2
5       1   6  15   65              0
6       3  10  30   78              2
7       3  23  40   67              2
8       3  24  60   34              0
9       3  56  42   98              1
10      3  45   2   96              1
11      3  10   4   46              0
12      3  56  10   76              2

Group by KMeans cluster in pandas dataframe

Question

2 answers

solution1
1 2021-05-24 07:40:47

solution2
1 ACCPTED 2021-05-24 08:20:09

Group by KMeans cluster in pandas dataframe

Question

2 answers

solution1 1 2021-05-24 07:40:47

solution2 1 ACCPTED 2021-05-24 08:20:09

solution1
1 2021-05-24 07:40:47

solution2
1 ACCPTED 2021-05-24 08:20:09