简体   繁体   English

在 R 中使用多核来分析 GWAS 数据

[英]Using multicore in R to analyse GWAS data

I am using R to analyze genome-wide association study data.我正在使用 R 来分析全基因组关联研究数据。 I have about 500,000 potential predictor variables (single-nucleotide polymorphisms, or SNPs) and want to test the association between each of them and a continuous outcome (in this case low-density lipoprotein concentration in the blood).我有大约 500,000 个潜在的预测变量(单核苷酸多态性,或 SNP),并且想要测试它们中的每一个与连续结果(在这种情况下是血液中的低密度脂蛋白浓度)之间的关联。

I have already written a script that does this without problem.我已经编写了一个可以毫无问题地执行此操作的脚本。 To briefly explain, I have a data object, called "Data".简单解释一下,我有一个名为“Data”的数据对象。 Each row corresponds to a particular patient in the study.每行对应于研究中的特定患者。 There are columns for age, gender, body mass index (BMI), and blood LDL concentration.有年龄、性别、体重指数 (BMI) 和血液 LDL 浓度的列。 There are also half a million other columns with the SNP data.还有 50 万列其他包含 SNP 数据的列。

I am currently using a for loop to run the linear model half a million times, as shown:我目前正在使用 for 循环来运行线性模型一百万次,如图所示:

# Repeat loop half a million times
for(i in 1:500000) {

# Select the appropriate SNP
SNP <- Data[i]

# For each iteration, perform linear regression adjusted for age, gender, and BMI and save the result in an object called "GenoMod"
GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)

# For each model, save the p value and error for each SNP. I save these two data points in columns 1 and 2 of a matrix called "results"
results[i,1] <- summary(GenoMod)$coefficients["Geno","Pr(>|t|)"]
results[i,2] <- summary(GenoMod)$coefficients["Geno","Estimate"]
}

All of that works fine.所有这些工作正常。 However, I would really like to speed up my analysis.但是,我真的很想加快我的分析速度。 I've therefore been experimenting with the multicore, DoMC, and foreach packages.因此,我一直在试验多核、DoMC 和 foreach 包。

My question is, could someone please help me adapt this code using the foreach scheme?我的问题是,有人可以帮助我使用 foreach 方案调整此代码吗?

I am running the script on a Linux server that apparently has 16 cores available.我在显然有 16 个可用内核的 Linux 服务器上运行该脚本。 I've tried experimenting with the foreach package, and my results using it have been comparatively worse, meaning that it takes longer to run the analysis using foreach.我尝试过使用 foreach 包进行试验,但使用它的结果相对较差,这意味着使用 foreach 运行分析需要更长的时间

For example, I've tried saving the linear model objects as shown:例如,我尝试保存线性模型对象,如下所示:

library(doMC)
registerDoMC()
results <- foreach(i=1:500000) %dopar% { lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data) }

This takes more than twice as long as using just a regular for loop.这比仅使用常规 for 循环需要两倍多的时间。 Any advice on how to do this better or more quickly would be appreciated!任何关于如何更好或更快速地做到这一点的建议将不胜感激! I understand that using the parallel version of lapply might be an option, but don't know how to do this either.我知道使用 lapply 的并行版本可能是一种选择,但也不知道如何做到这一点。

All the best,一切顺利,

Alex亚历克斯

To give you a startup: If you use Linux, you can do the multicore approach contained within the parallel package.给你一个启动:如果你使用 Linux,你可以使用parallel包中包含的multicore方法。 Whereas you needed to set up the whole thing when using eg the foreach package, that's not necessary any more with this approach.虽然您在使用例如 foreach 包时需要设置整个事情,但使用这种方法就不再需要了。 Your code would be run on 16 cores by simply doing :您的代码只需执行以下操作即可在 16 个内核上运行:

require(parallel)

mylm <- function(i){
  SNP <- Data[i]
  GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)
  #return the vector
  c(summary(GenoMod)$coefficients["Geno","Pr(>|t|)"],
    summary(GenoMod)$coefficients["Geno","Estimate"])
}

Out <- mclapply(1:500000, mylm,mc.cores=16) # returns list
Result <- do.call(rbind,Out) # make list a matrix

Here you make a function that returns a vector with the wanted quantities, and apply the indices over this.在这里,您创建了一个函数,该函数返回一个具有所需数量的向量,并在其上应用索引。 I couldn't check this though as I don't have access to the data, but it should work.我无法检查这一点,因为我无权访问数据,但它应该可以工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM