简体   繁体   中英

How to correctly filter out gene expression matrix by correlation value?

I have preprocessed Affymetrix microarray gene expression data (32830 probesets in rows, 735 RNA sample in columns). Here is how my expression matrix looks like:

> exprs_mat[1:6, 1:4]
             Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04 Tarca_005_P1A05
1_at                6.062215        6.125023        5.875502        6.126131
10_at               3.796484        3.805305        3.450245        3.628411
100_at              5.849338        6.191562        6.550525        6.421877
1000_at             3.567779        3.452524        3.316134        3.432451
10000_at            6.166815        5.678373        6.185059        5.633757
100009613_at        4.443027        4.773199        4.393488        4.623783

I have also phenodata of this Affymetrix expression (RNA sample identifiers in the row, sample descriptions in the column):

 > pheno[1:6, 1:4]
                       SampleID   GA Batch     Set
Tarca_001_P1A01 Tarca_001_P1A01 11.0     1 PRB_HTA
Tarca_013_P1B01 Tarca_013_P1B01 15.3     1 PRB_HTA
Tarca_025_P1C01 Tarca_025_P1C01 21.7     1 PRB_HTA
Tarca_037_P1D01 Tarca_037_P1D01 26.7     1 PRB_HTA
Tarca_049_P1E01 Tarca_049_P1E01 31.3     1 PRB_HTA
Tarca_061_P1F01 Tarca_061_P1F01 32.1     1 PRB_HTA

since in phenodata, sample identifier in rows, I need to find way to match sampleID in phenodata with sampleID in expression matrix exprs_mat .

OBJECTIVE :

I want to filter out the genes in the expression matrix by the measuing correlation between each gene with target profile data in phenodata . Here is my initial attempt but not quite sure about accuracy:

update: my implementation in R :

I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data. Here is my simple function to find this correlation in R:

getPCC <- function(expr_mat, anno_mat, verbose=FALSE){
stopifnot(class(expr_mat)=="matrix")
stopifnot(class(anno_mat)=="matrix")
stopifnot(ncol(expr_mat)==nrow(anno_mat))
final_df <- as.data.frame()
lapply(colnames(expr_mat), function(x){
    lapply(x, rownames(y){
        if(colnames(x) %in% rownames(anno_mat)){
            cor_mat <- stats::cor(y, anno_mat$GA, method = "pearson")
            ncor <- ncol(cor_mat)
            cmatt <- col(cor_mat)
            ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
            colnames(ord) <- colnames(cor_mat)
            res <- cbind(ID=c(cold(ord), ID2=c(ord)))
            res <- as.data.frame(cbind(out, cor=cor_mat[res]))
            final_df <- cbind(res, cor=cor_mat[out])
        }
    })
})
return(final_df)

}

but above script didn't return the correct output that I am expecting. Any idea to make this happen correctly? any thoughts?

does something like this help:

library(tidyverse)

x <- data.frame(stringsAsFactors=FALSE,
     Levels = c("1_at", "10_at", "100_at", "1000_at", "10000_at", "100009613_at"),
     Tarca_001_P1A01 = c(6.062215, 3.796484, 5.849338, 3.567779, 6.166815,
                           4.443027),
     Tarca_003_P1A03 = c(6.125023, 3.805305, 6.191562, 3.452524, 5.678373,
                           4.773199),
     Tarca_004_P1A04 = c(5.875502, 3.450245, 6.550525, 3.316134, 6.185059,
                           4.393488),
     Tarca_005_P1A05 = c(6.126131, 3.628411, 6.421877, 3.432451, 5.633757,
                           4.623783)
     )


y <- data.frame(stringsAsFactors=FALSE,
     gene = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
              "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
     SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", "Tarca_025_P1C01",
                    "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01"),
     GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1),
     Batch = c(1, 1, 1, 1, 1, 1),
     Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA")
     )



x %>% gather(SampleID, value, -Levels) %>% 
  left_join(., y, by = "SampleID") %>% 
  group_by(SampleID) %>% 
  filter(value == max(value)) %>% 
  spread(SampleID, value)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM