简体   繁体   中英

Remap kmeans labels_ based on sorted cluster_centers_

I'm using KMeans to cluster records in a data set based off of one column, cards , which is an int. However, the cluster labels returned are in an non-intuitive order (which is expected since it's an unsupervised algorithm).

To make the output more intelligible for my colleagues, I would like to remap the labels to the order of the cluster_centers_ .

I created a DataFrame where index is the KMeans generated label, set_size is the intended new label (having been sorted on the min column), but I'm stuck on the last leg of the puzzle.

How do I remap the cluster_df['set_size'] values to all_sets_df['set_size'] where all_sets_df['cluster'] == cluster_df['index'] ?

I've tried variations on apply, lambda, map, using a dict, but for some reason I get np.nan in 2/3 of the results (if it works at all). I feel like this is really obvious, but for some reason I can't get it to work.

# Assign optimal clusters to all_sets_df.set_size column

print('Assigning sets to clusters...', end='')
X = all_sets_df.cards.reshape(-1, 1)

n_clusters = 3

km = KMeans(n_clusters=n_clusters, init='k-means++', n_init=10)

all_sets_df['cluster'] = km.fit_predict(X)

cluster_df = pd.DataFrame.from_dict(
    {_i: {'set_size': _i, 
          'min': all_sets_df.cards[all_sets_df.cluster == _i].min(), 
          'max': all_sets_df.cards[all_sets_df.cluster == _i].max()}
    for _i in range(n_clusters)}, 
    orient='index').sort_values(by='min').reset_index()

cluster_df['set_size'] = range(len(cluster_df.set_size))

print('done.\n')

print(cluster_df.ix[:, ['index', 'set_size', 'min', 'max']].to_string(index=False))

Output:

Assigning sets to clusters...done.

index  set_size  min  max
    2         0    1  100
    0         1  113  230
    1         2  244  449

Thank you for your help.

I needed to change the line

_f = lambda x: cluster_df['set_size'][cluster_df.index == x].values

to

_f = lambda x: cluster_df['set_size'][cluster_df['index'] == x].values[0]

as it was using the actual dataframe index which was identical to the cluster label when mapping values from the cluster_df table. Also, the values returned by the lambda function were lists of length 1, so [0] needed to be added to the end of the function.

Here is the final code block that ended up working for me.

_f = lambda x: cluster_df['set_size'][cluster_df['index'] == x].values[0]
all_sets_df['set_size'] = all_sets_df['cluster'].map(_f)
all_sets_df = all_sets_df.drop('cluster', axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM