简体   繁体   English

主成分分析,成分标签?

[英]principal component analysis, label of component?

I have a dataframe with 17 columns (each column for one gene) and 34 rows (each row for one patient) 我有一个数据框,其中有17列(一个基因的每一列)和34行(一个病人的每一行)

Patient EXO1 MLH1 MSH2 MSH3 MSH6 PCNA PMS1 PMS2 POLE POLE2 POLE3 POLH RFC2 
1651109    0    0    1    1    1    1    1    1    1     0     1    0    0      
1651648    0    1    1    1    1    0    1    0    1     0     0    1    1  
........

The name of the dataframe is, say, testdb . 数据框的名称为testdb Then I run 然后我跑

res=princomp(testdb);  
summary(res);

and that shows 那表明

Importance of components:  
                          Comp.1    Comp.2    Comp.3     Comp.4     Comp.5  
Standard deviation     0.6577676 0.4757815 0.4138278 0.39002636 0.37679135  
Proportion of Variance 0.2822533 0.1476757 0.1117206 0.09923892 0.09261812  
Cumulative Proportion  0.2822533 0.4299290 0.5416497 0.64088859 0.73350672  
....

It is stupid that the names are comp.1 comp.2 comp.3 .... How can I map the name back to gene name? 名称是comp.1 comp.2 comp.3 ....这太愚蠢了。如何将名称映射回基因名称? I know biplot(res) will print some of the genes on the output graph, but that obviously is not the correct way to get gene name. 我知道biplot(res)将在输出图上打印一些基因,但这显然不是获取基因名称的正确方法。

Although most of this has already been stated in comments, I'm turning this into an answer. 尽管大多数内容已在评论中说明,但我正在将其转化为答案。

The components of a primary component analysis are linear combinations of your original variables. 主成分分析的成分是原始变量的线性组合。 So there is no one-to-one mapping between components and genes. 因此,成分和基因之间没有一对一的映射。 Excepting special cases, every component describes multiple genes. 除特殊情况外,每个成分都描述多个基因。 Some of them with a positive and some with a negative contribution. 他们中有些人有积极贡献,有些人则有负面贡献。 Some with large and some with small absolute values. 有些具有较大的绝对值,有些则具有较小的绝对值。 You can see these contributions from the loading matrix: enter loadings(res) and you will see the composition of each component. 您可以从加载矩阵中看到这些贡献:输入loadings(res) ,您将看到每个组件的组成。

You can find the gene with maximum absolute value in the column for a specific component in the loadings matrix. 您可以在上样矩阵的列中找到具有最大绝对值的基因。 That way you could identify something like a “primary contributor” to each component. 这样,您就可以确定每个组件的“主要贡献者”。 But unless that contribution was very close to one, treating the component as a synonym for the gene would be misleading at best. 但是除非这种贡献非常接近,否则将其作为基因的代名词充其量只会产生误导。 If you want your analysis in terms of individual genes, PCA is not the right tool. 如果要根据单个基因进行分析,则PCA并不是正确的工具。

If you are sure you want the “main contributor” despite the above warnings, the following code does that: 如果您确定尽管有上述警告,但仍希望“主要贡献者”,则可以使用以下代码进行操作:

l <- loadings(res)
rownames(l)[apply(l, 2, function(x) which.max(abs(x)))]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM