简体   繁体   English

如何为 Kmeans 散布 plot 并打印异常值

[英]How to scatter plot for Kmeans and print the outliers

I'm working with the Scikit-Learn KMeans model.我正在使用 Scikit-Learn KMeans model。

This is the code I have implemented, where I have created 3 clusters (0, 1, 2):这是我实现的代码,我在其中创建了 3 个集群(0、1、2):

df = pd.read_csv(r'1.csv',index_col=None)
dummies = pd.get_dummies(data = df)
km = KMeans(n_clusters=3).fit(dummies)
dummies['cluster_id'] = km.labels_
def distance_to_centroid(row, centroid):
    row = row[['id', 'product', 'store', 'revenue','store_capacity', 'state_AL', 'state_CA', 'state_CH',
       'state_WD', 'country_India', 'country_Japan', 'country_USA']]
    return euclidean(row, centroid)
dummies['distance_to_center0'] = dummies.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[0]),1)

dummies['distance_to_center1'] = dummies.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[1]),1)

dummies['distance_to_center2'] = dummies.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[2]),1)

dummies.head()

This is a sample of the data set that I am using:这是我正在使用的数据集的示例:

   id,product,store,revenue,store_capacity,state
    1,Ball,AB,222,1000,CA
    1,Pen,AB,234,1452,WD
    2,Books,CD,543,888,MA
    2,Ink,EF,123,9865,NY
  • How can I create a scatter plot for the clusters?如何为集群创建分散 plot?
  • How can I get and print the outliers (the points away from the cluster)?如何获取和打印异常值(远离集群的点)?

To create a scatter plot for the clusters you just need to color each point by his cluster.要为集群创建散点 plot,您只需按集群为每个点着色。 Take for example the following code:以下面的代码为例:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns

df = pd.DataFrame(np.random.rand(10,2), columns=["A", "B"])
km = KMeans(n_clusters=3).fit(df)
df['cluster_id'] = km.labels_
dic = {0:"Blue", 1:"Red", 2:"Green"}
sns.scatterplot(x="A", y="B", data=df, hue="cluster_id", palette = dic)

output: (remember it's involve random) output:(记住是随机的)

在此处输入图像描述

hue divide points by their 'cluster_id' value - in our case, different clusters. hue除以它们的“cluster_id”值——在我们的例子中,是不同的集群。 palette is just to control colors (which was defined in dic one line earlier) palette只是为了控制 colors (之前在dic中定义了一行)

Your data consists more then two labels.您的数据包含两个以上的标签。 As you know, we can not plot a 6-dimensional scatter plot.如您所知,我们不能 plot 一个 6 维散布 plot。 You can do one of the following:您可以执行以下操作之一:

  1. Select only 2 features and show them (feature selection) Select 只有 2 个特征并显示它们(特征选择)
  2. Reduce dimensions with PCA/TSNE/other algorithm and use new features for scatter (feature extraction)使用 PCA/TSNE/其他算法降维并使用新特征进行分散(特征提取)

As for your second question, it depends on how you define "outliers".至于你的第二个问题,这取决于你如何定义“异常值”。 There is no single definition, and it depends on the case.没有单一的定义,这取决于具体情况。 After running KMeans every point is assigned to a cluster.运行 KMeans 后,每个点都分配给一个集群。 KMeans does not give you "well, I'm not sure about that point. It's probably an outlier". KMeans 不会给你“好吧,我不确定这一点。它可能是一个异常值”。 Once you decide on a definition for outlier (eg "distance from center > 3") you just check if a point is an outlier, and print it.一旦您决定了异常值的定义(例如“距中心的距离 > 3”),您只需检查一个点是否为异常值,然后打印它。

If I misunderstood any of questions, please clarify.如果我误解了任何问题,请澄清。 It is better to be more precise about what you're trying to do in order for the community to help you.最好更准确地说明您正在尝试做什么,以便社区为您提供帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM