简体   繁体   English

如何并行运行 psych package?

[英]How to run the psych package in parallel?

I'm using the psych package to compute tetrachoric correlations for a very large dataset, comprising 1000 variables and 288,059 cases.我正在使用 psych package 计算一个非常大的数据集的四项相关性,该数据集包含 1000 个变量和 288,059 个案例。

The data can be downloaded here:数据可以在这里下载:

https://www.dropbox.com/s/iqwgdywqfjvlkku/data.csv.zip?dl=0 (4MB) https://www.dropbox.com/s/iqwgdywqfjvlkku/data.csv.zip?dl=0 (4MB)

My code looks like the following:我的代码如下所示:

library(psych)
library(tidyverse)

temp = read.csv("~/Temp/data.csv", sep=",")

tetravalues = tetrachoric(temp, delete=FALSE)

tetraframe = tetravalues$rho

write.csv(tetraframe, file="~/Temp/output.csv")

Currently, this bit of the code has been running for 8 hours and hasn't ended yet:目前这段代码已经运行了8个小时,还没有结束:

tetravalues = tetrachoric(temp, delete=FALSE)

According to the psych package manual (tetrachoric):根据 psych package 手册(tetrachoric):

This is a computationally intensive function which can be speeded up considerably by using mul- tiple cores and using the parallel package. The number of cores to use when doing polychoric or tetrachoric may be specified using the options command.这是一个计算密集型 function,可以通过使用多个内核和并行 package 来显着加快速度。可以使用选项命令指定执行多色或四色时要使用的内核数。 The greatest step up in speed is going from 1 cores to 2. This is about a 50% savings.速度的最大提升是从 1 核增加到 2 核。这大约节省了 50%。 Going to 4 cores seems to have about at 66% savings, and 8 a 75% savings.使用 4 个内核似乎可以节省 66%,而使用 8 个内核可以节省 75%。 The number of parallel processes defaults to 2 but can be modified by using the options command: options("mc.cores"=4) will set the number of cores to 4.并行进程数默认为 2,但可以使用 options 命令修改:options("mc.cores"=4) 会将核心数设置为 4。

My laptop has 10 cores.我的笔记本电脑有 10 个内核。

I'm new to R, and I haven't been able to figure out how to run my code in parallel.我是 R 的新手,一直无法弄清楚如何并行运行我的代码。

Any ideas are appreciated.任何想法表示赞赏。

library(psych) library(tidyverse)图书馆(psych) 图书馆(tidyverse)

temp = read.csv("~/Temp/data.csv", sep=",") temp = read.csv("~/Temp/data.csv", sep=",")

tetravalues = tetrachoric(temp, delete=FALSE) tetravalues = tetrachoric(temp, delete=FALSE)

tetraframe = tetravalues$rho tetraframe = tetravalues$rho

write.csv(tetraframe, file="~/Temp/output.csv") write.csv(tetraframe, file="~/Temp/output.csv")

You basically provided the answer to your question yourself.您基本上自己提供了问题的答案。

You can adjust the number of cores in the code below.您可以在下面的代码中调整核心数。 Note that when you want to use your laptop for other things while the computation is running, I would not set the number of cores to the maximum.请注意,当你想在计算运行时将笔记本电脑用于其他事情时,我不会将核心数设置为最大值。

Here is a quick intro about parallel computing in R. 是 R 中有关并行计算的快速介绍。

library(psych)
library(tidyverse)

# Here you can pick the number of cores. 
options("mc.cores"=4)

temp = read.csv("~/Temp/data.csv", sep=",")

tetravalues = tetrachoric(temp, delete=FALSE)

tetraframe = tetravalues$rho

write.csv(tetraframe, file="~/Temp/output.csv")

tetravalues = tetrachoric(temp, delete=FALSE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM