简体   繁体   English

如何为R中的蒙特卡罗创建更有效的模拟循环

[英]How to create a more efficient simulation loop for Monte Carlo in R

The purpose of this exercise is to create a population distribution of nutrient intake values. 本练习的目的是创建营养摄入值的人口分布。 There were repeated measures in the earlier data, these have been removed so each row is a unique person in the data frame. 在早期的数据中有重复的措施,这些已被删除,因此每一行都是数据框中的唯一人。

I have this code, which works quite well when tested with a small number of my data frame rows. 我有这个代码,在使用少量数据帧行测试时效果很好。 For all 7135 rows, it is very slow. 对于所有7135行,它非常慢。 I tried to time it, but I crashed it out when the elapsed running time on my machine was 15 hours. 我试着给它计时,但是当我机器上的运行时间为15小时后,我把它撞坏了。 The system.time results were Timing stopped at: 55625.08 2985.39 58673.87 . system.time结果是Timing stopped at: 55625.08 2985.39 58673.87

I would appreciate any comments on speeding up the simulation: 我很感激有关加速模拟的任何意见:

Male.MC <-c()
for (j in 1:100)            {
for (i in 1:nrow(Male.Distrib))  {
    u2        <- Male.Distrib$stddev_u2[i] * rnorm(1, mean = 0, sd = 1)
    mc_bca    <- Male.Distrib$FixedEff[i] + u2
    temp      <- Lambda.Value*mc_bca+1
    ginv_a    <- temp^(1/Lambda.Value)
    d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
    mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
z <- data.frame(
     RespondentID = Male.Distrib$RespondentID[i], 
     Subgroup     = Male.Distrib$Subgroup[i], 
     mc_amount    = mc_amount,
     IndvWeight   = Male.Distrib$INDWTS[i]/100
     )

Male.MC <- as.data.frame(rbind(Male.MC,z))
    }
}

For each of the 7135 observations in my dataset, 100 simulated nutrient values are created, then back transformed to the original measurement level (the simulation is using the results from a nonlinear mixed effect model on BoxCox transformed nutrient values). 对于我的数据集中的7135个观测值中的每一个,创建100个模拟营养物值,然后返回到原始测量水平(模拟使用来自BoxCox转化营养物值的非线性混合效应模型的结果)。

I would prefer not to use for loops, as I read that they are inefficient in R but I do not understand enough about options based on apply to use those as an alternative. 我宁愿不使用for循环,因为我读到它们在R中效率不高但我对基于apply选项不够了解,以便将它们用作替代方案。 R is being run on stand-alone machines, normally this would be a standard Dell-type desktop running a Windows 7 variant, if that influences the recommendations for how to change the code. R正在独立计算机上运行,​​通常这将是运行Windows 7变体的标准Dell型桌面,如果这会影响有关如何更改代码的建议。

Update: To reproduce this for testing, Lambda.Value =0.4 and Male.Resid.Var =12.1029420429778 and Male.Distrib$stddev_u2 is a constant value over all observations. 更新:要重现此测试, Lambda.Value = 0.4和Male.Resid.Var = 12.1029420429778和Male.Distrib$stddev_u2是所有观察值的常量值。

str(Male.Distrib) is str(Male.Distrib)

'data.frame':   7135 obs. of  14 variables:
 $ RndmEff     : num  1.34 -5.86 -3.65 2.7 3.53 ...
 $ RespondentID: num  9966 9967 9970 9972 9974 ...
 $ Subgroup    : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
 $ RespondentID: int  9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
 $ Replicates  : num  41067 2322 17434 21723 375 ...
 $ IntakeAmt   : num  33.45 2.53 9.58 43.34 55.66 ...
 $ RACE        : int  2 3 2 2 3 2 2 2 2 1 ...
 $ INDWTS      : num  41067 2322 17434 21723 375 ...
 $ TOTWTS      : num  1.21e+08 1.21e+08 1.21e+08 1.21e+08 1.21e+08 ...
 $ GRPWTS      : num  41657878 22715139 10520535 41657878 10791729 ...
 $ NUMSUBJECTS : int  1466 1100 1424 1466 1061 1466 1424 1252 1061 1424 ...
 $ TOTSUBJECTS : int  7135 7135 7135 7135 7135 7135 7135 7135 7135 7135 ...
 $ FixedEff    : num  6.09 6.76 7.08 6.09 6.18 ...
 $ stddev_u2   : num  2.65 2.65 2.65 2.65 2.65 ...

head(Male.Distrib) is head(Male.Distrib)

    RndmEff RespondentID Subgroup RespondentID Replicates IntakeAmt RACE INDWTS    TOTWTS   GRPWTS NUMSUBJECTS TOTSUBJECTS  FixedEff stddev_u2
1  1.343753         9966        6         9966      41067 33.449808    2  41067 120622201 41657878        1466        7135  6.089918  2.645938
2 -5.856516         9967        5         9967       2322  2.533528    3   2322 120622201 22715139        1100        7135  6.755664  2.645938
3 -3.648339         9970        4         9970      17434  9.575439    2  17434 120622201 10520535        1424        7135  7.079757  2.645938
4  2.697533         9972        6         9972      21723 43.340180    2  21723 120622201 41657878        1466        7135  6.089918  2.645938
5  3.531878         9974        3         9974        375 55.660607    3    375 120622201 10791729        1061        7135  6.176319  2.645938
6  6.627767         9976        6         9976      48889 91.480049    2  48889 120622201 41657878        1466        7135  6.089918  2.645938

Update 2: the line of the function that is causing the NaN results is 更新2:导致NaN结果的函数行是

d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))

Thanks to everyone for their assistance and comments, and also for the speed of responses. 感谢大家的帮助和评论,以及回复的速度。

Update: @Ben Bolker is correct that it is the negative temp values that are causing the NaN issue. 更新:@Ben Bolker是正确的,它是导致NaN问题的负temp值。 I missed this with some testing (after commenting out the function so that only the temp values are returned, and calling my result data frame Test ). 我错过了一些测试(在注释掉函数之后,只返回temp值,并调用我的结果数据框Test )。 This code reproduces the NaN issue: 此代码重现NaN问题:

> min(Test)
[1] -2.103819
> min(Test)^(1/Lambda.Value)
[1] NaN

But putting the value in as a value and then running the same(?) calculation gives me a result, so I missed this when doing manual calculations: 但是将值作为值放入然后运行相同的(?)计算会给我一个结果,所以在进行手动计算时我错过了这个:

> -2.103819^(1/Lambda.Value) 
[1] -6.419792

I now have working code that (I think) uses vectorization, and it is blindingly fast. 我现在有工作代码(我认为)使用矢量化,而且速度非常快。 Just in case anyone else has this issue, I am posting the working code below. 万一其他人有这个问题,我发布下面的工作代码。 I've had to add a minimum to prevent the <0 issue with the calculation. 我必须添加一个最小值来防止计算的<0问题。 Thank you to everyone who helped, and to coffee. 感谢所有帮助过的人和咖啡。 I did try putting the rnorm results to a dataframe, and that really slowed things down, creating them this way and then using cbind is really quick. 我确实尝试将rnorm结果放到数据帧中,这确实减慢了速度,以这种方式创建它们然后使用cbind非常快。 Male.Distrib is my full data frame of 7135 observations, but this code should work on the cutdown version I posted earlier (not tested). Male.Distrib是我的7135观测的完整数据框架,但是此代码应该适用于我之前发布的缩减版本(未经测试)。

Min_bca <- ((.5*min(Male.AddSugar$IntakeAmt))^Lambda.Value-1)/Lambda.Value
Test <- Male.Distrib[rep(seq.int(1,nrow(Male.Distrib)), 100), 1:ncol(Male.Distrib)]
RnormOutput <- rnorm(nrow(Test),0,1)
Male.Final <- cbind(Test,RnormOutput)
Male.Final$mc_bca    <- Male.Final$FixedEff + (Male.Final$stddev_u2 *     Male.Final$RnormOutput)
Male.Final$temp      <- ifelse(Lambda.Value*Male.Final$mc_bca+1 > Lambda.Value*Min_bca+1,
                           Lambda.Value*Male.Final$mc_bca+1, Lambda.Value*Min_bca+1)
Male.Final$ginv_a    <- Male.Final$temp^(1/Lambda.Value)
Male.Final$d2ginv_a  <- ifelse(0 > (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2),
                           0, (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2))
Male.Final$mc_amount <- Male.Final$ginv_a + Male.Final$d2ginv_a * Male.Resid.Var / 2

Lessons for the day: 当天的教训:

  • a distribution function does not appear to be resampled in a loop if you try to do what I was trying earlier 如果您尝试执行我之前尝试的操作,则分布函数似乎不会在循环中重新采样
  • you can't use max() the way I tried, as it returns the maximum value from the column, whereas I wanted the maximum from two values. 你不能像我尝试的那样使用max() ,因为它从列中返回最大值,而我想从两个值中获得最大值。 The ifelse statement is the replacement one to do. ifelse语句是替代语句。

Here is an approach that addresses the 2 biggest speed issues: 这是解决2个最大速度问题的方法:

  1. Instead of looping over observations( i ), we compute them all at once. 而不是循环观察( i ),我们一次计算它们。
  2. Instead of looping over MC replications ( j ), we use replicate , which is a simplified apply meant for this purpose. 我们使用replicate ,而不是循环MC复制( j ),这是一个简化的apply ,用于此目的。

First we load the dataset and define a function for what you were doing. 首先,我们加载数据集并为您正在做的事情定义一个函数。

Male.Distrib = read.table('MaleDistrib.txt', check.names=F)

getMC <- function(df, Lambda.Value=0.4, Male.Resid.Var=12.1029420429778) {
  u2        <- df$stddev_u2 * rnorm(nrow(df), mean = 0, sd = 1)
  mc_bca    <- df$FixedEff + u2
  temp      <- Lambda.Value*mc_bca+1
  ginv_a    <- temp^(1/Lambda.Value)
  d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
  mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
  mc_amount
}

Then we replicate it a bunch of times. 然后我们复制了很多次。

> replicate(10, getMC(Male.Distrib))
         [,1]      [,2]     [,3]     [,4]      [,5]     [,6]     [,7]     [,8]     [,9]    [,10]
[1,] 36.72374 44.491777 55.19637 23.53442 23.260609 49.56022 31.90657 25.26383 25.31197 20.58857
[2,] 29.56115 18.593496 57.84550 22.01581 22.906528 22.15470 29.38923 51.38825 13.45865 21.47531
[3,] 61.27075 10.140378 75.64172 28.10286  9.652907 49.25729 23.82104 31.77349 16.24840 78.02267
[4,] 49.42798 22.326136 33.87446 14.00084 25.107143 25.75241 30.20490 33.14770 62.86563 27.33652
[5,] 53.45546  9.673162 22.66676 38.76392 30.786100 23.42267 28.40211 35.95015 43.75506 58.83676
[6,] 34.72440 23.786004 63.57919  8.08238 12.636745 34.11844 14.88339 21.93766 44.53451 51.12331

Then you can reformat, add IDs, etc., but this is the idea for the main computational part. 然后你可以重新格式化,添加ID等,但这是主要计算部分的想法。 Good luck! 祝好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM