R中的簇二元矩阵

Question

I have a binary matrix between 2 variables. 我有2个变量之间的二进制矩阵。 I would like to know if there is a way to cluster the binary matrix in R. If so, which algorithm should I be using? 我想知道是否有一种方法可以在R中对二进制矩阵进行聚类。如果可以，我应该使用哪种算法？

The matrix looks like this 矩阵看起来像这样

        hobby1  hobby2  hobby3  hobby4
person1   1       0       0       1
person2   0       1       0       1
person3   1       1       1       0
person4   0       1       1       1

So clustering those persons by the most common hobbies they have. 因此，将这些人按他们最常见的爱好聚在一起。 What is the best method to do it? 最好的方法是什么？

Thanks 谢谢

Answer 1

How about crossprod() and reshape2::melt() : 怎么样crossprod()和reshape2::melt() ：

# CREATE THE MATRIX
m.h<-(matrix(sample(0:1,200,T),nrow=20))

# CREATE CROSS_PRODUCT
m.cross<-matrix(unlist(lapply(1:nrow(m.h),function(x)crossprod(m.h[x,],t(m.h)))),nrow=nrow(m.h),byrow=T)

# USE reshape2 to melt/flatten the data
require(reshape2)
m.long<-melt(m.cross)
m.long[order(m.long$value,factor(m.long$Var2),factor(m.long$Var1)),]

require(ggplot2)
ggplot(m.long)+
  geom_tile(aes(Var1,Var2,fill=value))+
  geom_text(aes(Var1,Var2,label=value))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  scale_fill_gradient(low="yellow",high="red") +
  scale_x_discrete(breaks = 1:nrow(m.h), labels=unlist(lapply(1:nrow(m.h),function(x)paste0("Person ",x)))) + 
  scale_y_discrete(breaks = 1:nrow(m.h), labels=unlist(lapply(1:nrow(m.h),function(x)paste0("Person ",x)))) +
  coord_cartesian(xlim=c(0,nrow(m.h)+1),ylim=c(0,nrow(m.h)+1))

在此处输入图片说明

Answer 2

Are you wondering what is a useful similarity/dissimilarity metric for clustering binary data? 您是否想知道对二进制数据进行聚类的有用的相似度/不相似度度量是什么？ There is the Jaccard index /coefficient, which is 有Jaccard索引 /系数，即

(size of intersection) / (size of union) （交叉点的大小）/（联合的大小）

aka (# of shared 1's) / (# of columns where one of the two rows has a 1). aka（共享1的数量）/（两行之一具有1的列数）。 The corresponding Jaccard distance would be 1 - the Jaccard index. 相应的Jaccard距离将为1-Jaccard索引。 There is also the simple matching coefficient, which is 还有一个简单的匹配系数，即

(size of intersection) / (length of vectors) （交集大小）/（向量长度）

I'm sure there are other distance metrics proposed for binary data. 我确定还有其他针对二进制数据的距离指标。 This really is a statistics question so you should consult a book on that subject. 这确实是一个统计问题，因此您应该参考有关该主题的书。

In R specifically, you can use dist(x, method="binary") , in which case I believe the Jaccard index is used. 特别是在R中，您可以使用dist(x, method="binary") ，在这种情况下，我相信会使用Jaccard索引。 You then use the distance matrix object dist.obj in your choice of a clustering algorithm (eg hclust ). 然后，您可以在选择聚类算法（例如hclust ）时使用距离矩阵对象dist.obj。

R中的簇二元矩阵

问题描述

2 个解决方案

解决方案1
1 2013-12-12 04:12:30

解决方案2
0 2015-06-03 01:56:37

R中的簇二元矩阵

问题描述

2 个解决方案

解决方案1 1 2013-12-12 04:12:30

解决方案2 0 2015-06-03 01:56:37

解决方案1
1 2013-12-12 04:12:30

解决方案2
0 2015-06-03 01:56:37