简体   繁体   English

R:自动生存分析

[英]R: Automated Survival Analysis

Below is example data where in genomicmatrix , each row corresponds to a gene ( "sample" ), and each cell corresponds to a value for that gene for a patient after which the column is named (in the format "TCGA-__-____-__" ). 以下是示例数据,其中在genomicmatrix ,每一行对应一个基因( "sample" ),每个单元格对应于该患者的该基因的值,并以该列命名(格式为"TCGA-__-____-__" )。 (Question continues below) (问题在下面继续)

genomicmatrix <- data.frame("sample" =  c("BIX","HEF","TUR","ZOP","VAG"), 
                  "TCGA-K4-6303-01" = runif(5, -1, 1),
                  "TCGA-DM-A28E-01" = runif(5, -1, 1),
                  "TCGA-AY-6197-01" = runif(5, -1, 1),
                  "TCGA-F4-6703-01" = runif(5, -1, 1),
                  "TCGA-HB-KH8H-01" = runif(5, -1, 1),
                  "TCGA-Y7-PIK2-01" = runif(5, -1, 1),
                  "TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(genomicmatrix) <- gsub("[.]", "_",colnames(genomicmatrix))

sample = NULL
sample <- genomicmatrix$sample

genomicmatrix$sample = NULL
means = NULL

for(z in 1:nrow(genomicmatrix)) {
  means[z] <- rowMeans(genomicmatrix[z,])
}

genemeans <- data.frame(sample, means)

So, after finding the mean value for each row (gene) as above, I extract the patient names that have a value for that gene which is GREATER than the mean value for that gene. 因此,在如上所述找到每一行(基因)的平均值之后,我提取出该基因的值比该基因的平均值大的患者姓名。 Those "greater than" patients for each gene go into a list for that gene in an element called high (eg for the fourth gene, the "greater than" patients appear in high[[4]] ). 每个基因的那些“大于”患者会在一个称为“ high的元素中进入该基因的列表(例如,对于第四个基因,“大于”的患者会出现在“ high[[4]] )。 The same goes for "lesser than" patients, which go to an element called low . 这同样适用于“比较小”的患者,哪去了称为元素low

high = NULL
low = NULL
high <- list(list())
low <- list(list())
uplist = NULL
downlist = NULL
for (i in 1:nrow(genomicmatrix)) {
  uplist = NULL
  downlist = NULL
  for (y in seq_along(genomicmatrix)) {
    uplist[y] <- ifelse(genomicmatrix[i,y] > genemeans$means[i], names(genomicmatrix[y]), "")
    downlist[y] <- ifelse(genomicmatrix[i,y] < genemeans$means[i], names(genomicmatrix[y]), "")
    high[[i]] <- uplist
    low[[i]] <- downlist
  }
}

So, for each gene, I split the patients in "high expression" and "low expression" categories. 因此,对于每个基因,我将患者分为“高表达”和“低表达”类别。 For example, the patients that show up in low[[3]] are those that have an expression for the third gene ( "TUR" ) that is lower than the average for that gene. 例如,出现low[[3]]的患者是那些第三基因( "TUR" )的表达低于该基因平均值的患者。 Below, I have a conversion table for patientID - to survival time in days. 下面,我有一个PatientID转换表-以天为单位的生存时间。

survival = NULL
survival$sampleID <- c("TCGA-K4-6303-01", "TCGA-DM-A28E-01", "TCGA-AY-6197-01", "TCGA-F4-6703-01", "TCGA-HB-KH8H-01", "TCGA-Y7-PIK2-01", "TCGA-A6-5657-01")
survival$X_OS <- c(256, 26, 88, 491, 553, 177, 732)
survival$sampleID <- chartr("-", "_", survival$sampleID)

I'd like to, from that setup, extract log rank test pvalues for each gene. 我想从该设置中提取每个基因的对数秩检验p值。 That is, for gene 1 ( "BIX" ) for example, given the Kaplan-Meier survival curves for high expression versus low expression (ie high[[1]] vs low[[1]] ), I wish to extract the corresponding pvalue coming from a log rank test of those two vectors (answering the question: is there a significant difference in survival outcome for high expression and low expression patients FOR THAT GENE?). 也就是说,例如对于基因1( "BIX" ),给定高表达与低表达的Kaplan-Meier生存曲线(即high[[1]]low[[1]] ),我希望提取相应的pvalue来自这两个向量的对数秩检验(回答这个问题:对于那个基因,高表达和低表达患者的生存结果是否存在显着差异?)。 Once that pvalue is derived, it should of course move on to the next gene. 一旦得出该p值,它当然应该继续进入下一个基因。

(If you're only asking for a tool to perform an operation or for statistical advice, then StackOverflow might not be the right place for this question.) (如果您只要求执行操作的工具或提供统计建议,则StackOverflow可能不是解决此问题的合适位置。)

Nonetheless, I'd suggest some improvements with the format of your data and your code that should be helpful in achieving your goals in R. If you have not significant memory constraints, you could transform your "genomicmatrix" into a long format "data.frame": 尽管如此,我还是建议您对数据格式和代码进行一些改进,这些改进应该有助于实现R中的目标。如果您没有明显的内存限制,则可以将“ genomicmatrix”转换为长格式的“ data”。帧”:

longDF = reshape(genomicmatrix, direction = "long", idvar = "sample", 
                 varying = list(2:8), times = colnames(genomicmatrix[-1]), 
                 timevar = "ID", v.names = "value")
row.names(longDF) = NULL
head(longDF)
#  sample              ID      value
#1    BIX TCGA_K4_6303_01 -0.4811441
#2    HEF TCGA_K4_6303_01 -0.2665017
#3    TUR TCGA_K4_6303_01  0.8367469
#4    ZOP TCGA_K4_6303_01 -0.5868480
#5    VAG TCGA_K4_6303_01 -0.0319600
#6    BIX TCGA_DM_A28E_01  0.3435170

Then you could find out which patients have higher and lower than mean expression and create a "data.frame": 然后,您可以找出哪些患者的平均表达高于和低于平均表达,并创建一个“ data.frame”:

exprs =  do.call(rbind, 
                 lapply(split(longDF, longDF$sample), 
                        function(x) { 
                           x$expr = ifelse(findInterval(x$value, mean(x$value)) == 1, 
                                           "high", 
                                           "low") 
                           x 
                        }))
row.names(exprs) = NULL
head(exprs)
#  sample              ID      value expr
#1    BIX TCGA_K4_6303_01 -0.4811441  low
#2    BIX TCGA_DM_A28E_01  0.3435170 high
#3    BIX TCGA_AY_6197_01  0.2269158 high
#4    BIX TCGA_F4_6703_01 -0.8283441  low
#5    BIX TCGA_HB_KH8H_01  0.4024671 high
#6    BIX TCGA_Y7_PIK2_01 -0.2979979  low

Then add "survival$X_OS": 然后添加“ survival $ X_OS”:

exprs$X_OS = survival$X_OS[match(exprs$ID, survival$sampleID)]
head(exprs)
#  sample              ID      value expr X_OS
#1    BIX TCGA_K4_6303_01 -0.4811441  low  256
#2    BIX TCGA_DM_A28E_01  0.3435170 high   26
#3    BIX TCGA_AY_6197_01  0.2269158 high   88
#4    BIX TCGA_F4_6703_01 -0.8283441  low  491
#5    BIX TCGA_HB_KH8H_01  0.4024671 high  553
#6    BIX TCGA_Y7_PIK2_01 -0.2979979  low  177

Then, assuming you have a function log_rank_test that takes two vectors and outputs a "p.value" you could use something like: 然后,假设您有一个函数log_rank_test ,它接受两个向量并输出一个“ p.value”,则可以使用类似以下内容的方法:

#lapply(split(exprs[c("expr", "X_OS")], exprs$sample), 
#        function(x) log_rank_test(x$X_OS[x$expr == "high"], x$X_OS[x$expr == "low"]))

I'm attempting a "data.table" approach, although it might not be idiomatic or could be improved since I'm not familiar with it: 我正在尝试“ data.table”方法,尽管由于我不熟悉它,它可能不是惯用的或可以改进的:

library(data.table)
library(reshape2)
DT = as.data.table(genomicmatrix)
longDT = melt(DT, "sample", variable.name = "ID")
longDT[, expr := ifelse(findInterval(value, mean(value)) == 1, "high", "low"), by = sample]
longDT[, X_OS := survival$X_OS[match(ID, survival$sampleID)]]
head(longDT)
#   sample              ID      value expr X_OS
#1:    BIX TCGA_K4_6303_01 -0.4811441  low  256
#2:    HEF TCGA_K4_6303_01 -0.2665017  low  256
#3:    TUR TCGA_K4_6303_01  0.8367469 high  256
#4:    ZOP TCGA_K4_6303_01 -0.5868480  low  256
#5:    VAG TCGA_K4_6303_01 -0.0319600  low  256
#6:    BIX TCGA_DM_A28E_01  0.3435170 high   26

And the, run your log_rank_test function like: 然后,运行您的log_rank_test函数,如下所示:

#longDT[, log_rank_test(X_OS[expr == "high"], X_OS[expr == "low"]), by = sample]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM