如何通过相关值正确过滤出基因表达矩阵？

Question

I have preprocessed Affymetrix microarray gene expression data (32830 probesets in rows, 735 RNA sample in columns). 我已经预处理了Affymetrix微阵列基因表达数据（行中有32830个探针组，列中有735个RNA样品）。 Here is how my expression matrix looks like: 这是我的表达式矩阵：

> exprs_mat[1:6, 1:4]
             Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04 Tarca_005_P1A05
1_at                6.062215        6.125023        5.875502        6.126131
10_at               3.796484        3.805305        3.450245        3.628411
100_at              5.849338        6.191562        6.550525        6.421877
1000_at             3.567779        3.452524        3.316134        3.432451
10000_at            6.166815        5.678373        6.185059        5.633757
100009613_at        4.443027        4.773199        4.393488        4.623783

I have also phenodata of this Affymetrix expression (RNA sample identifiers in the row, sample descriptions in the column): 我也有此Affymetrix表达的表型（行中为RNA样品标识符，列中为样品描述）：

 > pheno[1:6, 1:4]
                       SampleID   GA Batch     Set
Tarca_001_P1A01 Tarca_001_P1A01 11.0     1 PRB_HTA
Tarca_013_P1B01 Tarca_013_P1B01 15.3     1 PRB_HTA
Tarca_025_P1C01 Tarca_025_P1C01 21.7     1 PRB_HTA
Tarca_037_P1D01 Tarca_037_P1D01 26.7     1 PRB_HTA
Tarca_049_P1E01 Tarca_049_P1E01 31.3     1 PRB_HTA
Tarca_061_P1F01 Tarca_061_P1F01 32.1     1 PRB_HTA

since in phenodata, sample identifier in rows, I need to find way to match sampleID in phenodata with sampleID in expression matrix exprs_mat . 由于在phenodata中，行中的样本标识符，我需要找到一种方法来将phenodata中的sampleID与表达式矩阵exprs_mat 。

OBJECTIVE : 目标：

I want to filter out the genes in the expression matrix by the measuing correlation between each gene with target profile data in phenodata . 我想通过测量每个基因与phenodata目标谱数据之间的相关性来过滤表达矩阵中的基因。 Here is my initial attempt but not quite sure about accuracy: 这是我最初的尝试，但不确定准确性：

update: my implementation in R : 更新：我在R中的实现 ：

I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data. 我打算看看每个样本中的基因如何与注释数据中相应样本的GA值相关联。 Here is my simple function to find this correlation in R: 这是我在R中找到此相关性的简单函数：

getPCC <- function(expr_mat, anno_mat, verbose=FALSE){
stopifnot(class(expr_mat)=="matrix")
stopifnot(class(anno_mat)=="matrix")
stopifnot(ncol(expr_mat)==nrow(anno_mat))
final_df <- as.data.frame()
lapply(colnames(expr_mat), function(x){
    lapply(x, rownames(y){
        if(colnames(x) %in% rownames(anno_mat)){
            cor_mat <- stats::cor(y, anno_mat$GA, method = "pearson")
            ncor <- ncol(cor_mat)
            cmatt <- col(cor_mat)
            ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
            colnames(ord) <- colnames(cor_mat)
            res <- cbind(ID=c(cold(ord), ID2=c(ord)))
            res <- as.data.frame(cbind(out, cor=cor_mat[res]))
            final_df <- cbind(res, cor=cor_mat[out])
        }
    })
})
return(final_df)

} }

but above script didn't return the correct output that I am expecting. 但以上脚本未返回我期望的正确输出。 Any idea to make this happen correctly? 有什么想法可以正确实现吗？ any thoughts? 有什么想法吗？

Answer 1

does something like this help: 做这样的帮助：

library(tidyverse)

x <- data.frame(stringsAsFactors=FALSE,
     Levels = c("1_at", "10_at", "100_at", "1000_at", "10000_at", "100009613_at"),
     Tarca_001_P1A01 = c(6.062215, 3.796484, 5.849338, 3.567779, 6.166815,
                           4.443027),
     Tarca_003_P1A03 = c(6.125023, 3.805305, 6.191562, 3.452524, 5.678373,
                           4.773199),
     Tarca_004_P1A04 = c(5.875502, 3.450245, 6.550525, 3.316134, 6.185059,
                           4.393488),
     Tarca_005_P1A05 = c(6.126131, 3.628411, 6.421877, 3.432451, 5.633757,
                           4.623783)
     )


y <- data.frame(stringsAsFactors=FALSE,
     gene = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
              "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
     SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
                    "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
     GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1),
     Batch = c(1, 1, 1, 1, 1, 1),
     Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA")
     )



x %>% gather(SampleID, value, -Levels) %>% 
  left_join(., y, by = "SampleID") %>% 
  group_by(SampleID) %>% 
  filter(value == max(value)) %>% 
  spread(SampleID, value)

如何通过相关值正确过滤出基因表达矩阵？

问题描述

1 个解决方案

解决方案1
1 2019-06-21 00:12:34

如何通过相关值正确过滤出基因表达矩阵？

问题描述

1 个解决方案

解决方案1 1 2019-06-21 00:12:34

解决方案1
1 2019-06-21 00:12:34