[英]How to create a more efficient simulation loop for Monte Carlo in R
The purpose of this exercise is to create a population distribution of nutrient intake values. 本练习的目的是创建营养摄入值的人口分布。 There were repeated measures in the earlier data, these have been removed so each row is a unique person in the data frame.
在早期的数据中有重复的措施,这些已被删除,因此每一行都是数据框中的唯一人。
I have this code, which works quite well when tested with a small number of my data frame rows. 我有这个代码,在使用少量数据帧行测试时效果很好。 For all 7135 rows, it is very slow.
对于所有7135行,它非常慢。 I tried to time it, but I crashed it out when the elapsed running time on my machine was 15 hours.
我试着给它计时,但是当我机器上的运行时间为15小时后,我把它撞坏了。 The
system.time
results were Timing stopped at: 55625.08 2985.39 58673.87
. system.time
结果是Timing stopped at: 55625.08 2985.39 58673.87
。
I would appreciate any comments on speeding up the simulation: 我很感激有关加速模拟的任何意见:
Male.MC <-c()
for (j in 1:100) {
for (i in 1:nrow(Male.Distrib)) {
u2 <- Male.Distrib$stddev_u2[i] * rnorm(1, mean = 0, sd = 1)
mc_bca <- Male.Distrib$FixedEff[i] + u2
temp <- Lambda.Value*mc_bca+1
ginv_a <- temp^(1/Lambda.Value)
d2ginv_a <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
z <- data.frame(
RespondentID = Male.Distrib$RespondentID[i],
Subgroup = Male.Distrib$Subgroup[i],
mc_amount = mc_amount,
IndvWeight = Male.Distrib$INDWTS[i]/100
)
Male.MC <- as.data.frame(rbind(Male.MC,z))
}
}
For each of the 7135 observations in my dataset, 100 simulated nutrient values are created, then back transformed to the original measurement level (the simulation is using the results from a nonlinear mixed effect model on BoxCox transformed nutrient values). 对于我的数据集中的7135个观测值中的每一个,创建100个模拟营养物值,然后返回到原始测量水平(模拟使用来自BoxCox转化营养物值的非线性混合效应模型的结果)。
I would prefer not to use for
loops, as I read that they are inefficient in R
but I do not understand enough about options based on apply
to use those as an alternative. 我宁愿不使用
for
循环,因为我读到它们在R
中效率不高但我对基于apply
选项不够了解,以便将它们用作替代方案。 R
is being run on stand-alone machines, normally this would be a standard Dell-type desktop running a Windows 7 variant, if that influences the recommendations for how to change the code. R
正在独立计算机上运行,通常这将是运行Windows 7变体的标准Dell型桌面,如果这会影响有关如何更改代码的建议。
Update: To reproduce this for testing, Lambda.Value
=0.4 and Male.Resid.Var
=12.1029420429778 and Male.Distrib$stddev_u2
is a constant value over all observations. 更新:要重现此测试,
Lambda.Value
= 0.4和Male.Resid.Var
= 12.1029420429778和Male.Distrib$stddev_u2
是所有观察值的常量值。
str(Male.Distrib)
is str(Male.Distrib)
是
'data.frame': 7135 obs. of 14 variables:
$ RndmEff : num 1.34 -5.86 -3.65 2.7 3.53 ...
$ RespondentID: num 9966 9967 9970 9972 9974 ...
$ Subgroup : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
$ RespondentID: int 9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
$ Replicates : num 41067 2322 17434 21723 375 ...
$ IntakeAmt : num 33.45 2.53 9.58 43.34 55.66 ...
$ RACE : int 2 3 2 2 3 2 2 2 2 1 ...
$ INDWTS : num 41067 2322 17434 21723 375 ...
$ TOTWTS : num 1.21e+08 1.21e+08 1.21e+08 1.21e+08 1.21e+08 ...
$ GRPWTS : num 41657878 22715139 10520535 41657878 10791729 ...
$ NUMSUBJECTS : int 1466 1100 1424 1466 1061 1466 1424 1252 1061 1424 ...
$ TOTSUBJECTS : int 7135 7135 7135 7135 7135 7135 7135 7135 7135 7135 ...
$ FixedEff : num 6.09 6.76 7.08 6.09 6.18 ...
$ stddev_u2 : num 2.65 2.65 2.65 2.65 2.65 ...
head(Male.Distrib)
is head(Male.Distrib)
是
RndmEff RespondentID Subgroup RespondentID Replicates IntakeAmt RACE INDWTS TOTWTS GRPWTS NUMSUBJECTS TOTSUBJECTS FixedEff stddev_u2
1 1.343753 9966 6 9966 41067 33.449808 2 41067 120622201 41657878 1466 7135 6.089918 2.645938
2 -5.856516 9967 5 9967 2322 2.533528 3 2322 120622201 22715139 1100 7135 6.755664 2.645938
3 -3.648339 9970 4 9970 17434 9.575439 2 17434 120622201 10520535 1424 7135 7.079757 2.645938
4 2.697533 9972 6 9972 21723 43.340180 2 21723 120622201 41657878 1466 7135 6.089918 2.645938
5 3.531878 9974 3 9974 375 55.660607 3 375 120622201 10791729 1061 7135 6.176319 2.645938
6 6.627767 9976 6 9976 48889 91.480049 2 48889 120622201 41657878 1466 7135 6.089918 2.645938
Update 2: the line of the function that is causing the NaN
results is 更新2:导致
NaN
结果的函数行是
d2ginv_a <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
Thanks to everyone for their assistance and comments, and also for the speed of responses. 感谢大家的帮助和评论,以及回复的速度。
Update: @Ben Bolker is correct that it is the negative temp
values that are causing the NaN issue. 更新:@Ben Bolker是正确的,它是导致NaN问题的负
temp
值。 I missed this with some testing (after commenting out the function so that only the temp
values are returned, and calling my result data frame Test
). 我错过了一些测试(在注释掉函数之后,只返回
temp
值,并调用我的结果数据框Test
)。 This code reproduces the NaN
issue: 此代码重现
NaN
问题:
> min(Test)
[1] -2.103819
> min(Test)^(1/Lambda.Value)
[1] NaN
But putting the value in as a value and then running the same(?) calculation gives me a result, so I missed this when doing manual calculations: 但是将值作为值放入然后运行相同的(?)计算会给我一个结果,所以在进行手动计算时我错过了这个:
> -2.103819^(1/Lambda.Value)
[1] -6.419792
I now have working code that (I think) uses vectorization, and it is blindingly fast. 我现在有工作代码(我认为)使用矢量化,而且速度非常快。 Just in case anyone else has this issue, I am posting the working code below.
万一其他人有这个问题,我发布下面的工作代码。 I've had to add a minimum to prevent the <0 issue with the calculation.
我必须添加一个最小值来防止计算的<0问题。 Thank you to everyone who helped, and to coffee.
感谢所有帮助过的人和咖啡。 I did try putting the
rnorm
results to a dataframe, and that really slowed things down, creating them this way and then using cbind
is really quick. 我确实尝试将
rnorm
结果放到数据帧中,这确实减慢了速度,以这种方式创建它们然后使用cbind
非常快。 Male.Distrib
is my full data frame of 7135 observations, but this code should work on the cutdown version I posted earlier (not tested). Male.Distrib
是我的7135观测的完整数据框架,但是此代码应该适用于我之前发布的缩减版本(未经测试)。
Min_bca <- ((.5*min(Male.AddSugar$IntakeAmt))^Lambda.Value-1)/Lambda.Value
Test <- Male.Distrib[rep(seq.int(1,nrow(Male.Distrib)), 100), 1:ncol(Male.Distrib)]
RnormOutput <- rnorm(nrow(Test),0,1)
Male.Final <- cbind(Test,RnormOutput)
Male.Final$mc_bca <- Male.Final$FixedEff + (Male.Final$stddev_u2 * Male.Final$RnormOutput)
Male.Final$temp <- ifelse(Lambda.Value*Male.Final$mc_bca+1 > Lambda.Value*Min_bca+1,
Lambda.Value*Male.Final$mc_bca+1, Lambda.Value*Min_bca+1)
Male.Final$ginv_a <- Male.Final$temp^(1/Lambda.Value)
Male.Final$d2ginv_a <- ifelse(0 > (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2),
0, (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2))
Male.Final$mc_amount <- Male.Final$ginv_a + Male.Final$d2ginv_a * Male.Resid.Var / 2
Lessons for the day: 当天的教训:
max()
the way I tried, as it returns the maximum value from the column, whereas I wanted the maximum from two values. max()
,因为它从列中返回最大值,而我想从两个值中获得最大值。 The ifelse
statement is the replacement one to do. ifelse
语句是替代语句。 Here is an approach that addresses the 2 biggest speed issues: 这是解决2个最大速度问题的方法:
i
), we compute them all at once. i
),我们一次计算它们。 j
), we use replicate
, which is a simplified apply
meant for this purpose. replicate
,而不是循环MC复制( j
),这是一个简化的apply
,用于此目的。 First we load the dataset and define a function for what you were doing. 首先,我们加载数据集并为您正在做的事情定义一个函数。
Male.Distrib = read.table('MaleDistrib.txt', check.names=F)
getMC <- function(df, Lambda.Value=0.4, Male.Resid.Var=12.1029420429778) {
u2 <- df$stddev_u2 * rnorm(nrow(df), mean = 0, sd = 1)
mc_bca <- df$FixedEff + u2
temp <- Lambda.Value*mc_bca+1
ginv_a <- temp^(1/Lambda.Value)
d2ginv_a <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
mc_amount
}
Then we replicate it a bunch of times. 然后我们复制了很多次。
> replicate(10, getMC(Male.Distrib))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 36.72374 44.491777 55.19637 23.53442 23.260609 49.56022 31.90657 25.26383 25.31197 20.58857
[2,] 29.56115 18.593496 57.84550 22.01581 22.906528 22.15470 29.38923 51.38825 13.45865 21.47531
[3,] 61.27075 10.140378 75.64172 28.10286 9.652907 49.25729 23.82104 31.77349 16.24840 78.02267
[4,] 49.42798 22.326136 33.87446 14.00084 25.107143 25.75241 30.20490 33.14770 62.86563 27.33652
[5,] 53.45546 9.673162 22.66676 38.76392 30.786100 23.42267 28.40211 35.95015 43.75506 58.83676
[6,] 34.72440 23.786004 63.57919 8.08238 12.636745 34.11844 14.88339 21.93766 44.53451 51.12331
Then you can reformat, add IDs, etc., but this is the idea for the main computational part. 然后你可以重新格式化,添加ID等,但这是主要计算部分的想法。 Good luck!
祝好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.