[英]How to create a hierarchical cluster using categorical and numerical data is R?
I want to create a hierarchical cluster to show types of careers and the balance that those who are in those careers have in their bank account. 我想创建一个层次结构集群,以显示职业类型以及从事这些职业的人在银行帐户中的余额。 I a dataset with two variables, job and balance:
我是一个具有两个变量的数据集,即工作和平衡:
job balance
1 unemployed 1787
2 services 4789
3 management 1350
4 management 1476
5 blue-collar 0
6 management 747
7 self-employed 307
8 technician 147
9 entrepreneur 221
10 services -88
I want the result to look like this: 我希望结果看起来像这样:
Where A, B ,C etc are the job categories. 其中A,B,C等是职位类别。
Can anyone help me start this or give me some help? 谁能帮我开始这个工作或给我一些帮助?
I have no idea how to begin. 我不知道如何开始。
Thanks! 谢谢!
You can start by using the dist
and hclust
functions. 您可以使用
dist
和hclust
函数开始。
df <- read.table(text = " job balance
1 unemployed 1787
2 services 4789
3 management 1350
4 management 1476
5 blue-collar 0
6 management 747
7 self-employed 307
8 technician 147
9 entrepreneur 221
10 services -88")
dist
computes the distance between each element (by default, the euclidian distance): dist
计算每个元素之间的距离(默认情况下为欧几里得距离):
distances <- dist(df$balance)
You can then cluster you values using the distance matrix generated above: 然后,您可以使用上面生成的距离矩阵对值进行聚类:
clusters <- hclust(distances)
By default, hclust applies complete-linkage clustering to your data. 默认情况下,hclust将完全链接群集应用于您的数据。 Finally, you can plot your results as a tree:
最后,您可以将结果绘制成一棵树:
plot(clusters, labels = df$job)
Here, we clustered all the entries in your data frame, that's why some jobs are duplicated. 在这里,我们将您数据框中的所有条目聚集在一起,这就是为什么某些作业重复的原因。 If you want to have a single value per job, you can for example take the mean balance for each job using
tapply
: 如果您希望每个作业只有一个值,则可以使用
tapply
获取每个作业的平均余额:
means <- tapply(df$balance, df$job, mean)
And then cluster the jobs: 然后将作业聚类:
distances <- dist(means)
clusters <- hclust(distances)
plot(clusters)
You can then try to use other distance measures or other clustering algorithms (see help(dist)
and help(hclust)
for other methods). 然后,您可以尝试使用其他距离度量或其他聚类算法(有关其他方法,请参见
help(dist)
和help(hclust)
)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.