简体繁体 English

R：聚类 - 如何预测新病例？

[英]R: Clustering - how to predict new cases?

原文 2015-11-09 13:42:49 0 2 r/ machine-learning/ r-caret/ supervised-learning/ unsupervised-learning

I have 4000 (continuous) predictor variables in a set of 150 patients.我在一组 150 名患者中有 4000 个（连续）预测变量。 First, variables with are associated with survival should be identified.首先，应确定与生存相关的变量。 I therefore use the multiple testing procedures function ( http://svitsrv25.epfl.ch/R-doc/library/multtest/html/MTP.html ) with the t-statistic for tests of regression coefficients in Cox proportional hazards survival models to identify significant predictors.因此，我使用多重测试程序函数（ http://svitsrv25.epfl.ch/R-doc/library/multtest/html/MTP.html ）和 t 统计量来测试 Cox 比例风险生存模型中的回归系数确定重要的预测因素。 This analysis identifies 60 parameters which are significantly associated with survival.该分析确定了与生存显着相关的 60 个参数。 I then perform unsupervised k-means clustering with the ConensusClusterPlus package ( https://www.bioconductor.org/packages/release/bioc/html/ConsensusClusterPlus.html ) which identifies 3 clusters as the optimal solution based on the CDF curve & progression graph.然后我使用 ConensusClusterPlus 包 ( https://www.bioconductor.org/packages/release/bioc/html/ConsensusClusterPlus.html ) 执行无监督 k-means 聚类，它根据 CDF 曲线和进展将 3 个聚类确定为最佳解决方案图形。 If I then perform a Kaplan-Meier survival analysis I see that each of the three clusters is associated with a distinct survival pattern (low / intermediate / long survival).如果我随后进行 Kaplan-Meier 生存分析，我会发现三个集群中的每一个都与不同的生存模式（低/中/长期生存）相关。

The question that I now have is the following: Lets assume that I have another set of 50 patients where I´d like to predict to which of the three clusters each patient most likely belongs to.我现在的问题如下：假设我有另一组 50 名患者，我想预测每个患者最有可能属于三个集群中的哪一个。 How can I achieve this?我怎样才能做到这一点？ Do I need to train a classifier (eg with the caret-package (topepo.github.io/caret/bytag.html) where the 150 patients with the 60 significant parameters are in the training set and the algorithm knows which patient was allocated to which of the three clusters) and validate the classifier in the 50 new patients?我是否需要训练一个分类器（例如使用 caret-package (topepo.github.io/caret/bytag.html)，其中具有 60 个重要参数的 150 名患者在训练集中，并且算法知道分配给哪个患者？三个集群中的哪一个）并在 50 个新患者中验证分类器？ And then perform Kaplan-Meier survival analysis to see whether the predicted clusters in the validation set (n=50) are again associated with aa distinct survival pattern?然后进行 Kaplan-Meier 生存分析，看看验证集中的预测集群（n=50）是否再次与一个不同的生存模式相关联？

Thanks for your help.谢谢你的帮助。

2 个解决方案

The answer is much simpler.答案要简单得多。 You do have your k-means, with 3 clusters.你有你的 k-means，有 3 个集群。 Each cluster is identified by its centroid (a point in your 60-dimensional space).每个集群由其质心（60 维空间中的一个点）标识。 In order to "classify" new point you just measure the euclidean distance to each of these three centroids, and select cluster which is the closest one.为了“分类”新点，您只需测量到这三个质心中每一个的欧几里德距离，然后选择最接近的集群。 That's all.就这样。 It comes directly from the fact, that k-means gives you partitioning of the whole space, not just your training set.它直接来自这样一个事实，即 k-means 为您提供了整个空间的分区，而不仅仅是您的训练集。

My advice is to create a predictive model, such as random forest, using the cluster number as the outcome.我的建议是创建一个预测模型，例如随机森林，使用聚类数作为结果。 It will lead to better results than predicting using the distances in the cluster.与使用集群中的距离进行预测相比，它会产生更好的结果。

The reasons are several, but consider that a predictive model is specialized in such a task, for example, it will keep and consider reliable variables (while in the cluster every variable will account the same).原因有几个，但考虑到预测模型专门用于此类任务，例如，它将保留并考虑可靠变量（而在集群中，每个变量都将考虑相同）。