简体   繁体   中英

How to scatter plot for Kmeans and print the outliers

I'm working with the Scikit-Learn KMeans model.

This is the code I have implemented, where I have created 3 clusters (0, 1, 2):

df = pd.read_csv(r'1.csv',index_col=None)
dummies = pd.get_dummies(data = df)
km = KMeans(n_clusters=3).fit(dummies)
dummies['cluster_id'] = km.labels_
def distance_to_centroid(row, centroid):
    row = row[['id', 'product', 'store', 'revenue','store_capacity', 'state_AL', 'state_CA', 'state_CH',
       'state_WD', 'country_India', 'country_Japan', 'country_USA']]
    return euclidean(row, centroid)
dummies['distance_to_center0'] = dummies.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[0]),1)

dummies['distance_to_center1'] = dummies.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[1]),1)

dummies['distance_to_center2'] = dummies.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[2]),1)

dummies.head()

This is a sample of the data set that I am using:

   id,product,store,revenue,store_capacity,state
    1,Ball,AB,222,1000,CA
    1,Pen,AB,234,1452,WD
    2,Books,CD,543,888,MA
    2,Ink,EF,123,9865,NY
  • How can I create a scatter plot for the clusters?
  • How can I get and print the outliers (the points away from the cluster)?

To create a scatter plot for the clusters you just need to color each point by his cluster. Take for example the following code:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns

df = pd.DataFrame(np.random.rand(10,2), columns=["A", "B"])
km = KMeans(n_clusters=3).fit(df)
df['cluster_id'] = km.labels_
dic = {0:"Blue", 1:"Red", 2:"Green"}
sns.scatterplot(x="A", y="B", data=df, hue="cluster_id", palette = dic)

output: (remember it's involve random)

在此处输入图像描述

hue divide points by their 'cluster_id' value - in our case, different clusters. palette is just to control colors (which was defined in dic one line earlier)

Your data consists more then two labels. As you know, we can not plot a 6-dimensional scatter plot. You can do one of the following:

  1. Select only 2 features and show them (feature selection)
  2. Reduce dimensions with PCA/TSNE/other algorithm and use new features for scatter (feature extraction)

As for your second question, it depends on how you define "outliers". There is no single definition, and it depends on the case. After running KMeans every point is assigned to a cluster. KMeans does not give you "well, I'm not sure about that point. It's probably an outlier". Once you decide on a definition for outlier (eg "distance from center > 3") you just check if a point is an outlier, and print it.

If I misunderstood any of questions, please clarify. It is better to be more precise about what you're trying to do in order for the community to help you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM