I'm using KMeans to cluster records in a data set based off of one column, cards
, which is an int. However, the cluster labels returned are in an non-intuitive order (which is expected since it's an unsupervised algorithm).
To make the output more intelligible for my colleagues, I would like to remap the labels to the order of the cluster_centers_
.
I created a DataFrame where index
is the KMeans generated label, set_size
is the intended new label (having been sorted on the min
column), but I'm stuck on the last leg of the puzzle.
How do I remap the cluster_df['set_size']
values to all_sets_df['set_size']
where all_sets_df['cluster'] == cluster_df['index']
?
I've tried variations on apply, lambda, map, using a dict, but for some reason I get np.nan in 2/3 of the results (if it works at all). I feel like this is really obvious, but for some reason I can't get it to work.
# Assign optimal clusters to all_sets_df.set_size column
print('Assigning sets to clusters...', end='')
X = all_sets_df.cards.reshape(-1, 1)
n_clusters = 3
km = KMeans(n_clusters=n_clusters, init='k-means++', n_init=10)
all_sets_df['cluster'] = km.fit_predict(X)
cluster_df = pd.DataFrame.from_dict(
{_i: {'set_size': _i,
'min': all_sets_df.cards[all_sets_df.cluster == _i].min(),
'max': all_sets_df.cards[all_sets_df.cluster == _i].max()}
for _i in range(n_clusters)},
orient='index').sort_values(by='min').reset_index()
cluster_df['set_size'] = range(len(cluster_df.set_size))
print('done.\n')
print(cluster_df.ix[:, ['index', 'set_size', 'min', 'max']].to_string(index=False))
Output:
Assigning sets to clusters...done.
index set_size min max
2 0 1 100
0 1 113 230
1 2 244 449
Thank you for your help.
I needed to change the line
_f = lambda x: cluster_df['set_size'][cluster_df.index == x].values
to
_f = lambda x: cluster_df['set_size'][cluster_df['index'] == x].values[0]
as it was using the actual dataframe index which was identical to the cluster label when mapping values from the cluster_df table. Also, the values returned by the lambda function were lists of length 1, so [0]
needed to be added to the end of the function.
Here is the final code block that ended up working for me.
_f = lambda x: cluster_df['set_size'][cluster_df['index'] == x].values[0]
all_sets_df['set_size'] = all_sets_df['cluster'].map(_f)
all_sets_df = all_sets_df.drop('cluster', axis=1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.