简体   繁体   English

R中具有不同范围/尺度的连续异构变量的分层聚类

[英]Hierarchical clustering on continuous heterogeneous variables with different range/scales in R

I would like to use R to perform hierarchical clustering with two groups of variables describing the same samples.我想使用 R 对描述相同样本的两组变量执行层次聚类。 One group is microarray gene expression data (for specific genes) that have been normalized and batch effect corrected.一组是经过标准化和批次效应校正的微阵列基因表达数据(针对特定基因)。 The other group also has some quantitative clinical parameters that describe the same samples.另一组也有一些描述相同样本的定量临床参数。 However, these clinical variables have not been normalized or subjected to any kind of transformation(ie raw continuous values).然而,这些临床变量尚未标准化或进行任何类型的转换(即原始连续值)。

For example, one variable of these could have range of values from 2 to 35, whereas another from 0.1 to 0.9, etc.例如,其中一个变量的值范围为 2 到 35,而另一个变量的值范围为 0.1 到 0.9,等等。

Thus, as my ultimate goal in to implement hierarchical clustering and use both groups simultaneously (merged in a matrix/dataframe), in order to inspect which of these clinical variables cluster with specific genes, etc:因此,我的最终目标是实现层次聚类并同时使用两个组(合并在矩阵/数据框中),以检查这些临床变量中的哪些与特定基因聚类等:

1) Is an initial transformation in the group of the clinical variables necessary before merging with the genes and perform the clustering ? 1)在与基因合并并执行聚类之前,是否需要在临床变量组中进行初始转换? For example: log2 transformation, which has also been done to part of my gene expression data !!比如:log2变换,我的部分基因表达数据也做了这个!!

2) Or, a row scaling (that is the total features in the input data) would take into account this discrepancy ? 2)或者,行缩放(即输入数据中的总特征)会考虑这种差异?

3) For a similar analysis/approach, like constructing a correlation plot of the above total variables, would a simple scaling be sufficient? 3)对于类似的分析/方法,例如构建上述总变量的相关图,简单的缩放是否足够?

Without having seen your gene expression data, I can only provide you some general suggestions based on your description, in the context of the 3 questions you asked:在没有看到您的基因表达数据的情况下,我只能根据您的描述,结合您提出的 3 个问题为您提供一些一般性建议:

1) You should definitely check the distribution of each group. 1)你一定要检查每个组的分布。 In R, you may use one or more of the following function to visualize the distribution:在 R 中,您可以使用以下一个或多个函数来可视化分布:

hist(expression_data) ##histogram
plot(density(expression_data)) ##density plot; alternative to histogram
qqnorm(expression_data); qqline(expression_data) #QQ plot

Since my understanding is that one of your expression data group is log2 transformed, that particular group should have a normal distribution (ie a bell curve shape in the histogram and a straight line in the QQ plot).由于我的理解是您的表达式数据组之一是 log2 转换的,因此该特定组应该具有正态分布(即直方图中的钟形曲线形状和 QQ 图中的直线)。 Whether to transform the group that has not yet been transformed will depend on what you want to do with the data.是否转换尚未转换的组将取决于您要对数据做什么。 For instance, if you want to use a t-test to compare the two groups, then you definitely need a transformation, as there is a normality assumption associated with a t-test.例如,如果您想使用 t 检验来比较两组,那么您肯定需要进行转换,因为存在与 t 检验相关的正态性假设。 With regard to hierarchical clustering, if you decide to use both groups in a single clustering analysis, then why would you ever keep one transformed and the other not?关于层次聚类,如果您决定在单个聚类分析中使用两个组,那么为什么要保持一个变换而另一个不变换?

2) Scaling by features is a reasonable approach. 2)按特征缩放是一种合理的方法。 Here is a clustering lecture from a Utah State Univ.是犹他州立大学的聚类讲座。 stats course, with an example.统计课程,举个例子。 scale=TRUE is an option for you if you decide to use heatmap function in R.如果您决定在 R 中使用heatmap函数, scale=TRUE是您的一个选项。

3) I don't think there is a definitive answer to your third question. 3)我认为你的第三个问题没有明确的答案。 It has to depend on how many available features you have and what analyses you will be doing downstream.它必须取决于您拥有多少可用功能以及您将在下游进行哪些分析。 Similar to question 1, I would argue that simple scaling may be sufficient for visualizing your data by hierarchical clustering.与问题 1 类似,我认为简单的缩放可能足以通过层次聚类来可视化您的数据。 However, do keep in mind that, say you decide to perform a linear model (which is very common with microarray gene expression data), you might want to consider more sophisticated data scaling.但是,请记住,假设您决定执行线性模型(这在微阵列基因表达数据中很常见),您可能需要考虑更复杂的数据缩放。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM