简体   繁体   English

R中使用hclust的加权观测频率聚类

[英]Weighted observation frequency clustering using hclust in R

I have a large matrix of 500K observations to cluster using hierarchical clustering. 我有一个500K观测值的大型矩阵,可以使用分层聚类进行聚类。 Due to the large size, i do not have the computing power to calculate the distance matrix. 由于尺寸大,我没有计算能力来计算距离矩阵。

To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. 为了克服这个问题,我选择汇总我的矩阵以合并那些将我的矩阵减少到大约1万个观测值的观察值。 I have the frequency for each of the rows in this aggregated matrix. 我有此汇总矩阵中每一行的频率。 I now need to incorporate this frequency as a weight in my hierarchical clustering. 现在,我需要将此频率作为权重合并到我的层次结构集群中。

The data is a mixture of numerical and categorical variables for the 500K observations so i have used the daisy package to calculate the gower dissimilarity for my aggregated dataset. 数据是500K观测值的数字变量和分类变量的混合,因此我使用了菊花包来为我的聚合数据集计算出更高的相似度。 I want to use hclust in the stats package for the aggregated dataset however i want to take into account the frequency of each observation. 我想将stats包中的hclust用于聚合数据集,但是我想考虑每次观察的频率。 From the help information for hclust the arguments are as follows: 从hclust的帮助信息中,参数如下:

    hclust(d, method = "complete", members = NULL)

The information for the members argument is:, NULL or a vector with length size of d. 成员参数的信息为:NULL或长度为d的向量。 See the 'Details' section. 请参阅“详细信息”部分。 When you look at the details section you get: If members != NULL , then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. 当您查看详细信息部分时,您将获得:如果members != NULL ,则d被视为聚类之间的差异矩阵,而不是单例之间的相似性,并且成员给出每个聚类的观察次数。 This way the hierarchical cluster algorithm can be 'started in the middle of the dendrogram', eg, in order to reconstruct the part of the tree above a cut (see examples). 这样,可以“在树状图的中间开始”层次聚类算法,例如,以便在切割上方重建树的一部分(请参见示例)。 Dissimilarities between clusters can be efficiently computed (ie, without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. 仅对于有限数量的距离/链接组合,可以有效地计算聚类之间的差异(即,没有簇本身),最简单的是平方的欧几里得距离和质心链接。 In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means. 在这种情况下,聚类之间的差异是聚类平均值之间的平方欧几里德距离。

From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. 从上面的描述中,我不确定是否可以将频率权重分配给成员参数,因为尚不清楚这是否是此参数的目的。 I would like to use it like this: 我想这样使用它:

hclust(d, method = "complete", members = df$freq)

Where df$freq is the frequency of each row in the aggregated matrix. 其中df$freq是聚合矩阵中每一行的频率。 So if a row is duplicated 10 times this value would be 10. 因此,如果将一行重复10次,则该值为10。

If anyone can help me that would be great, 如果有人可以帮助我,那将是很棒的,

Thanks 谢谢

Yes, this should work fine for most linkages, in particular single, group average and complete linkage. 是的,这对于大多数链接(特别是单个链接,组平均链接和完整链接)都可以正常工作。 For ward etc. you need to correctly take the weights into account yourself. 对于病房等,您需要自己正确考虑体重。

But even that part is not hard. 但是即使那部分也不难。 Just make sure to use the cluster sizes, because you need to pass the distance of two clusters, not two points. 只需确保使用簇的大小,因为您需要传递两个簇的距离,而不是两个点。 So the matrix should contain the distance of n1 points at location x and n2 points at location y. 因此,矩阵应包含位置x处n1个点和位置y处n2个点的距离。 For min/max/mean this n disappears or cancels out. 对于最小值/最大值/平均值,此n消失或抵消。 For ward, you should get a SSQ like formula. 对于病房,您应该获得类似于SSQ的公式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM