[英]How to reduce dimension of gene expression matrix by calculating correlation coefficients?
[英]How to correctly filter out gene expression matrix by correlation value?
我已經預處理了Affymetrix微陣列基因表達數據(行中有32830個探針組,列中有735個RNA樣品)。 這是我的表達式矩陣:
> exprs_mat[1:6, 1:4]
Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04 Tarca_005_P1A05
1_at 6.062215 6.125023 5.875502 6.126131
10_at 3.796484 3.805305 3.450245 3.628411
100_at 5.849338 6.191562 6.550525 6.421877
1000_at 3.567779 3.452524 3.316134 3.432451
10000_at 6.166815 5.678373 6.185059 5.633757
100009613_at 4.443027 4.773199 4.393488 4.623783
我也有此Affymetrix表達的表型(行中為RNA樣品標識符,列中為樣品描述):
> pheno[1:6, 1:4]
SampleID GA Batch Set
Tarca_001_P1A01 Tarca_001_P1A01 11.0 1 PRB_HTA
Tarca_013_P1B01 Tarca_013_P1B01 15.3 1 PRB_HTA
Tarca_025_P1C01 Tarca_025_P1C01 21.7 1 PRB_HTA
Tarca_037_P1D01 Tarca_037_P1D01 26.7 1 PRB_HTA
Tarca_049_P1E01 Tarca_049_P1E01 31.3 1 PRB_HTA
Tarca_061_P1F01 Tarca_061_P1F01 32.1 1 PRB_HTA
由於在phenodata中,行中的樣本標識符,我需要找到一種方法來將phenodata中的sampleID與表達式矩陣exprs_mat
。
目標 :
我想通過測量每個基因與phenodata
目標譜數據之間的相關性來過濾表達矩陣中的基因。 這是我最初的嘗試,但不確定准確性:
更新:我在R中的實現 :
我打算看看每個樣本中的基因如何與注釋數據中相應樣本的GA值相關聯。 這是我在R中找到此相關性的簡單函數:
getPCC <- function(expr_mat, anno_mat, verbose=FALSE){
stopifnot(class(expr_mat)=="matrix")
stopifnot(class(anno_mat)=="matrix")
stopifnot(ncol(expr_mat)==nrow(anno_mat))
final_df <- as.data.frame()
lapply(colnames(expr_mat), function(x){
lapply(x, rownames(y){
if(colnames(x) %in% rownames(anno_mat)){
cor_mat <- stats::cor(y, anno_mat$GA, method = "pearson")
ncor <- ncol(cor_mat)
cmatt <- col(cor_mat)
ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
colnames(ord) <- colnames(cor_mat)
res <- cbind(ID=c(cold(ord), ID2=c(ord)))
res <- as.data.frame(cbind(out, cor=cor_mat[res]))
final_df <- cbind(res, cor=cor_mat[out])
}
})
})
return(final_df)
}
但以上腳本未返回我期望的正確輸出。 有什么想法可以正確實現嗎? 有什么想法嗎?
做這樣的幫助:
library(tidyverse)
x <- data.frame(stringsAsFactors=FALSE,
Levels = c("1_at", "10_at", "100_at", "1000_at", "10000_at", "100009613_at"),
Tarca_001_P1A01 = c(6.062215, 3.796484, 5.849338, 3.567779, 6.166815,
4.443027),
Tarca_003_P1A03 = c(6.125023, 3.805305, 6.191562, 3.452524, 5.678373,
4.773199),
Tarca_004_P1A04 = c(5.875502, 3.450245, 6.550525, 3.316134, 6.185059,
4.393488),
Tarca_005_P1A05 = c(6.126131, 3.628411, 6.421877, 3.432451, 5.633757,
4.623783)
)
y <- data.frame(stringsAsFactors=FALSE,
gene = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
"Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
"Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1),
Batch = c(1, 1, 1, 1, 1, 1),
Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA")
)
x %>% gather(SampleID, value, -Levels) %>%
left_join(., y, by = "SampleID") %>%
group_by(SampleID) %>%
filter(value == max(value)) %>%
spread(SampleID, value)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.