简体   繁体   English

如何使用分类和数值数据is R创建分层集群?

[英]How to create a hierarchical cluster using categorical and numerical data is R?

I want to create a hierarchical cluster to show types of careers and the balance that those who are in those careers have in their bank account. 我想创建一个层次结构集群,以显示职业类型以及从事这些职业的人在银行帐户中的余额。 I a dataset with two variables, job and balance: 我是一个具有两个变量的数据集,即工作和平衡:

              job balance
1       unemployed    1787
2         services    4789
3       management    1350
4       management    1476
5      blue-collar       0
6       management     747
7    self-employed     307
8       technician     147
9     entrepreneur     221
10        services     -88

I want the result to look like this: 我希望结果看起来像这样:

在此处输入图片说明

Where A, B ,C etc are the job categories. 其中A,B,C等是职位类别。

Can anyone help me start this or give me some help? 谁能帮我开始这个工作或给我一些帮助?

I have no idea how to begin. 我不知道如何开始。

Thanks! 谢谢!

You can start by using the dist and hclust functions. 您可以使用disthclust函数开始。

df <- read.table(text = "              job balance
1       unemployed    1787
2         services    4789
3       management    1350
4       management    1476
5      blue-collar       0
6       management     747
7    self-employed     307
8       technician     147
9     entrepreneur     221
10        services     -88")

dist computes the distance between each element (by default, the euclidian distance): dist计算每个元素之间的距离(默认情况下为欧几里得距离):

distances <- dist(df$balance)

You can then cluster you values using the distance matrix generated above: 然后,您可以使用上面生成的距离矩阵对值进行聚类:

clusters <- hclust(distances)

By default, hclust applies complete-linkage clustering to your data. 默认情况下,hclust将完全链接群集应用于您的数据。 Finally, you can plot your results as a tree: 最后,您可以将结果绘制成一棵树:

plot(clusters, labels = df$job)

Here, we clustered all the entries in your data frame, that's why some jobs are duplicated. 在这里,我们将您数据框中的所有条目聚集在一起,这就是为什么某些作业重复的原因。 If you want to have a single value per job, you can for example take the mean balance for each job using tapply : 如果您希望每个作业只有一个值,则可以使用tapply获取每个作业的平均余额:

means <- tapply(df$balance, df$job, mean)

And then cluster the jobs: 然后将作业聚类:

distances <- dist(means)
clusters <- hclust(distances)
plot(clusters)

You can then try to use other distance measures or other clustering algorithms (see help(dist) and help(hclust) for other methods). 然后,您可以尝试使用其他距离度量或其他聚类算法(有关其他方法,请参见help(dist)help(hclust) )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM