简体   繁体   中英

principal component analysis, label of component?

I have a dataframe with 17 columns (each column for one gene) and 34 rows (each row for one patient)

Patient EXO1 MLH1 MSH2 MSH3 MSH6 PCNA PMS1 PMS2 POLE POLE2 POLE3 POLH RFC2 
1651109    0    0    1    1    1    1    1    1    1     0     1    0    0      
1651648    0    1    1    1    1    0    1    0    1     0     0    1    1  
........

The name of the dataframe is, say, testdb . Then I run

res=princomp(testdb);  
summary(res);

and that shows

Importance of components:  
                          Comp.1    Comp.2    Comp.3     Comp.4     Comp.5  
Standard deviation     0.6577676 0.4757815 0.4138278 0.39002636 0.37679135  
Proportion of Variance 0.2822533 0.1476757 0.1117206 0.09923892 0.09261812  
Cumulative Proportion  0.2822533 0.4299290 0.5416497 0.64088859 0.73350672  
....

It is stupid that the names are comp.1 comp.2 comp.3 .... How can I map the name back to gene name? I know biplot(res) will print some of the genes on the output graph, but that obviously is not the correct way to get gene name.

Although most of this has already been stated in comments, I'm turning this into an answer.

The components of a primary component analysis are linear combinations of your original variables. So there is no one-to-one mapping between components and genes. Excepting special cases, every component describes multiple genes. Some of them with a positive and some with a negative contribution. Some with large and some with small absolute values. You can see these contributions from the loading matrix: enter loadings(res) and you will see the composition of each component.

You can find the gene with maximum absolute value in the column for a specific component in the loadings matrix. That way you could identify something like a “primary contributor” to each component. But unless that contribution was very close to one, treating the component as a synonym for the gene would be misleading at best. If you want your analysis in terms of individual genes, PCA is not the right tool.

If you are sure you want the “main contributor” despite the above warnings, the following code does that:

l <- loadings(res)
rownames(l)[apply(l, 2, function(x) which.max(abs(x)))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM