简体   繁体   English

如何通过相关值正确过滤出基因表达矩阵?

[英]How to correctly filter out gene expression matrix by correlation value?

I have preprocessed Affymetrix microarray gene expression data (32830 probesets in rows, 735 RNA sample in columns). 我已经预处理了Affymetrix微阵列基因表达数据(行中有32830个探针组,列中有735个RNA样品)。 Here is how my expression matrix looks like: 这是我的表达式矩阵:

> exprs_mat[1:6, 1:4]
             Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04 Tarca_005_P1A05
1_at                6.062215        6.125023        5.875502        6.126131
10_at               3.796484        3.805305        3.450245        3.628411
100_at              5.849338        6.191562        6.550525        6.421877
1000_at             3.567779        3.452524        3.316134        3.432451
10000_at            6.166815        5.678373        6.185059        5.633757
100009613_at        4.443027        4.773199        4.393488        4.623783

I have also phenodata of this Affymetrix expression (RNA sample identifiers in the row, sample descriptions in the column): 我也有此Affymetrix表达的表型(行中为RNA样品标识符,列中为样品描述):

 > pheno[1:6, 1:4]
                       SampleID   GA Batch     Set
Tarca_001_P1A01 Tarca_001_P1A01 11.0     1 PRB_HTA
Tarca_013_P1B01 Tarca_013_P1B01 15.3     1 PRB_HTA
Tarca_025_P1C01 Tarca_025_P1C01 21.7     1 PRB_HTA
Tarca_037_P1D01 Tarca_037_P1D01 26.7     1 PRB_HTA
Tarca_049_P1E01 Tarca_049_P1E01 31.3     1 PRB_HTA
Tarca_061_P1F01 Tarca_061_P1F01 32.1     1 PRB_HTA

since in phenodata, sample identifier in rows, I need to find way to match sampleID in phenodata with sampleID in expression matrix exprs_mat . 由于在phenodata中,行中的样本标识符,我需要找到一种方法来将phenodata中的sampleID与表达式矩阵exprs_mat

OBJECTIVE : 目标

I want to filter out the genes in the expression matrix by the measuing correlation between each gene with target profile data in phenodata . 我想通过测量每个基因与phenodata目标谱数据之间的相关性来过滤表达矩阵中的基因。 Here is my initial attempt but not quite sure about accuracy: 这是我最初的尝试,但不确定准确性:

update: my implementation in R : 更新:我在R中的实现

I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data. 我打算看看每个样本中的基因如何与注释数据中相应样本的GA值相关联。 Here is my simple function to find this correlation in R: 这是我在R中找到此相关性的简单函数:

getPCC <- function(expr_mat, anno_mat, verbose=FALSE){
stopifnot(class(expr_mat)=="matrix")
stopifnot(class(anno_mat)=="matrix")
stopifnot(ncol(expr_mat)==nrow(anno_mat))
final_df <- as.data.frame()
lapply(colnames(expr_mat), function(x){
    lapply(x, rownames(y){
        if(colnames(x) %in% rownames(anno_mat)){
            cor_mat <- stats::cor(y, anno_mat$GA, method = "pearson")
            ncor <- ncol(cor_mat)
            cmatt <- col(cor_mat)
            ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
            colnames(ord) <- colnames(cor_mat)
            res <- cbind(ID=c(cold(ord), ID2=c(ord)))
            res <- as.data.frame(cbind(out, cor=cor_mat[res]))
            final_df <- cbind(res, cor=cor_mat[out])
        }
    })
})
return(final_df)

} }

but above script didn't return the correct output that I am expecting. 但以上脚本未返回我期望的正确输出。 Any idea to make this happen correctly? 有什么想法可以正确实现吗? any thoughts? 有什么想法吗?

does something like this help: 做这样的帮助:

library(tidyverse)

x <- data.frame(stringsAsFactors=FALSE,
     Levels = c("1_at", "10_at", "100_at", "1000_at", "10000_at", "100009613_at"),
     Tarca_001_P1A01 = c(6.062215, 3.796484, 5.849338, 3.567779, 6.166815,
                           4.443027),
     Tarca_003_P1A03 = c(6.125023, 3.805305, 6.191562, 3.452524, 5.678373,
                           4.773199),
     Tarca_004_P1A04 = c(5.875502, 3.450245, 6.550525, 3.316134, 6.185059,
                           4.393488),
     Tarca_005_P1A05 = c(6.126131, 3.628411, 6.421877, 3.432451, 5.633757,
                           4.623783)
     )


y <- data.frame(stringsAsFactors=FALSE,
     gene = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
              "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
     SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
                    "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
     GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1),
     Batch = c(1, 1, 1, 1, 1, 1),
     Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA")
     )



x %>% gather(SampleID, value, -Levels) %>% 
  left_join(., y, by = "SampleID") %>% 
  group_by(SampleID) %>% 
  filter(value == max(value)) %>% 
  spread(SampleID, value)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何通过计算相关系数来缩小基因表达矩阵的维数? - How to reduce dimension of gene expression matrix by calculating correlation coefficients? 基因表达中的相关病例与对照 - Correlation Case vs Control in gene expression 根据值和出现次数过滤相关矩阵 - filter a correlation matrix based on value and occurrence 差异表达基因分析:如何在表达矩阵上用goups不同的临床基质进行t检验? - Differential expression gene analysis: how to do t.test on expression matrix with goups different clinical matrix? 基因表达和蛋白质表达之间具有Spearman相关系数的热图 - Heatmap with Spearman correlation coefficient between gene expression and protein expression 用 R 过滤相关矩阵及其 p 值矩阵 - Filter correlation matrix ands its p-value matrix with R 如何用颜色表示基因值的表达绘制小提琴图? - How to draw a violin plot with the color showing the expression of gene value? 滤波器相关矩阵R - Filter correlation matrix R 有没有办法让修拉不过滤掉感兴趣的基因? - Is there a way to make Seurat not filter out gene of interest? 如何计算 R 中大矩阵(affymetrix 基因表达数据)中所有列的范围和方差? - How to calculate range and variance for all the columns in a large matrix (affymetrix gene expression data) in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM