简体   繁体   English

如何在R中对顺序分类数据进行聚类

[英]How to Cluster Sequential Categorical Data in R

Consider a data set where users can choose among 3 activities, and we have the data for the choice of their first 10 activities. 考虑一个数据集,用户可以在其中选择3个活动,而我们拥有可供选择的前10个活动的数据。 Example data: 示例数据:

for (i in 1:10) 
{
  # sample from list of 3 strings using a set probability
  x <- sample( c("A", "B", "C"), 1000, replace=TRUE, prob=c(0.5, 0.3, 0.2) )
  # assign to variable created on the fly
  assign( paste("cat", i, sep=""), x )
}

first10 <- data.frame(cat1, cat2, cat3, cat4, cat5, cat6, cat7, cat8, cat9, cat10)

What's the best approach in R to cluster users according to their activity sequence? R中根据用户活动顺序对用户进行聚类的最佳方法是什么?

I've looked around on stackoverflow, and the most similar questions ask about how to cluster categorical data in R (which is part of the analysis), but this in and of itself doesn't account for the sequential nature of the data. 我到处都是stackoverflow,最相似的问题问如何在R中分类数据(这是分析的一部分),但这本身并不能说明数据的顺序性质。 Are there R packages that are well-suited for this analysis? 是否有R软件包非常适合此分析?

Look for frequent itemset mining instead of clustering. 寻找频繁的项集挖掘而不是聚类。

Most clustering methods are for continuous numerical data, and assume some vector field. 大多数聚类方法都是针对连续的数值数据,并假设一些矢量场。 They take every position into account. 他们考虑到每个职位。

A frequent pattern, however, may be only part if a sequence, a sequence may exhibit multiple (or none) of these patterns, and patterns may have gaps inbetween. 但是,频繁的模式可能仅是序列的一部分,序列可能会显示多个(或不显示)这些模式,并且模式之间可能会有间隙。 All of these properties are usually desirable. 所有这些特性通常是理想的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM