简体   繁体   English

向集群添加标签

[英]Adding labels to Cluster

I'm new to R and am attempting to cluster some data based on industry.我是 R 的新手,正在尝试根据行业对一些数据进行聚类。 I have learned that K-means cannot handle factors and categorical data.我了解到 K 均值无法处理因子和分类数据。 I have removed the factor called 'Industry' -- 67 distinct observations -- from my dataset but would like to assign each observation a label once the model is finished.我已经从我的数据集中删除了名为“工业”的因素——67 个不同的观察结果,但希望在模型完成后为每个观察结果分配一个标签。 Essentially, I would like my end result to look like the sample US Crime dataset.本质上,我希望我的最终结果看起来像样本美国犯罪数据集。 Any assistance would be greatly appreciated.任何帮助将不胜感激。

My results:我的结果:

在此处输入图片说明

My ideal result:我的理想结果:

在此处输入图片说明

Code:代码:

library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(ggplot2) ## used for plotting
library(gridExtra) ## used for plotting
library(robustbase)

###Read in dataset
df <- read.csv('my_data')
df2 <- scale(df)

### Subset of Data -- looking at percentage closed won and total opportunities
dat = df2[,c(1,3)]

# initial cluster split
k2 <- kmeans(dat, centers = 2, nstart = 25)
str(k2)
k2
fviz_cluster(k2, data = dat)

### Additional Plots
k3 <- kmeans(dat, centers = 3, nstart = 25)
k4 <- kmeans(dat, centers = 4, nstart = 25)
k5 <- kmeans(dat, centers = 5, nstart = 25)

# comparing plots
p1 <- fviz_cluster(k2, geom = "point", data = dat) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = dat) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = dat) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = dat) + ggtitle("k = 5")

grid.arrange(p1, p2, p3, p4, nrow = 2)

## Computing gap statistics
set.seed(123)
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)

## Visualization
fviz_gap_stat(gap_stat)

# Compute k-means clustering with k = 4
set.seed(123)
final <- kmeans(dat, 4, nstart = 25)
print(final)

## final visualization
fviz_cluster(final, data = dat)

I think all you need to do is:我认为你需要做的就是:

rownames(df) <- df$Industry

Then scale and subset.然后缩放和子集。 The industry name will be on the cluster plot instead of row numbers.行业名称将出现在聚类图上,而不是行号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM