简体   繁体   English

使用 For 循环时,“堆栈不平衡”警告对我的 Output 有何影响?

[英]What are the Implications for my Output of a "Stack Imbalance" Warning when Using a For Loop?

I made a function to calculate Population Weighted Average Densities (PWADs) of randomly generated (x,y) points on a graph in R.我做了一个 function 来计算 R 图形上随机生成的 (x,y) 点的人口加权平均密度 (PWAD) The function isn't perfect... I assign points to grid squares based not on the grid square they're in but instead based on which grid centre they're closest to, I then assume the area a centre pulls from is the size of the grid (either 1 or 0.25). function 并不完美......我不是根据它们所在的网格正方形而是根据它们最接近的网格中心将点分配给网格正方形,然后我假设中心拉出的区域是大小网格(1 或 0.25)。 The purpose of the function is to test sensitivity to grid location and size under varying "total populations". function 的目的是测试在不同“总人口”下对网格位置和大小的敏感性。 This is the function (the comments were written for a future version of me):这是 function(评论是为我的未来版本写的):

pwad.grid <- function(xval, yval, totalpop) {
  values <- data.frame(xval, yval)
  
  # I'll use the four grids that I've already got
  # the original size grids
  
  # I'm not cleaning up this code, i.e. it's as it was originally developed for a 
  # non-function use when I only wanted to do this once
  # again, I'm using minimum Euclidean distance of (x, y) point to grid centre to 
  # fudge assignation to grids
  
  lowbounds3 <- 0.25
  highbounds3 <- 10.25
  
  centres1 <- data.frame(x=seq(0.05, 1, .1) * 10, y=10 *  
                          as.vector(matrix(rep(seq(0.05, 1, .1), each=10), 
                                           nrow=10)))
  centres2 <- data.frame(x=seq(0, 10, 1), y=10 *  
                          as.vector(matrix(rep(seq(0, 1, .1), each=11),
                                           nrow=11)))
  centres3 <- data.frame(x=seq(lowbounds3, highbounds3, 1), y=
                          as.vector(matrix(rep(seq(lowbounds3, highbounds3, 1),
                                               each=11), nrow=11)))
  
  #the quarter size integer aligned grid
  
  lowbounds4 <- 0.25
  highbounds4 <- 9.75
  centres4 <- data.frame(x=seq(lowbounds4, highbounds4, .5), y=
                          as.vector(matrix(rep(seq(lowbounds4, highbounds4, .5), 
                                               each=20), nrow=20)))
  
  # and now the stores, which I alter somewhat to allow for varying populations
  # reminder: these are the calculations for the Euclidean distances
  # the code inside the matrix calculates the distances on a repeated entry basis
  # i.e. when there are 100 centres, each (x, y) is repeated 100 times, once for 
  # each centre #the matrix then arranges the results so that each (x, y) occupies 
  # only one row once again
  
  stores1 <- matrix(sqrt(rowSums((values[rep(1:totalpop, each=100), ] - 
                                    centres1[rep(1:100, totalpop), ])^2)), 
                    ncol=100, byrow=TRUE)
  
  stores2 <- matrix(sqrt(rowSums((values[rep(1:totalpop, each=121), ] - 
                                    centres2[rep(1:121, totalpop), ])^2)), 
                    ncol=121, byrow=TRUE)
  
  stores3 <- matrix(sqrt(rowSums((values[rep(1:totalpop, each=121), ] - 
                                    centres3[rep(1:121, totalpop), ])^2)), 
                    ncol=121, byrow=TRUE)
  
  stores4 <- matrix(sqrt(rowSums((values[rep(1:totalpop, each=400), ] - 
                                    centres4[rep(1:400, totalpop), ])^2)), 
                    ncol=400, byrow=TRUE)
  
  # assigning points to groups based on the minimum Euclidean Distance
  groups1 <- max.col(-stores1)
  groups2 <- max.col(-stores2)
  groups3 <- max.col(-stores3)
  groups4 <- max.col(-stores4)
  
  # calculating the PWADs
  pwad1 <- sum(table(groups1) * table(groups1)/totalpop)
  pwad2 <- sum(table(groups2) * table(groups2)/totalpop)
  pwad3 <- sum(table(groups3) * table(groups3)/totalpop)
  mill <- table(groups4) / 0.25
  pwad4 <- sum(mill * table(groups4)/totalpop)
  
  # outputs grouped together
  data.frame(pwad1, pwad2, pwad3, pwad4)
}

In order to look at the effects of varying population size, I have been using for loops within R.为了查看不同人口规模的影响,我一直在 R 中使用 for 循环。 Each loop is 1000 iterations and generates four groups of 1000 PWADs (one for each grid type).每个循环是 1000 次迭代并生成四组 1000 个 PWAD(每个网格类型一个)。 For a population bigger than 100, the loop takes more than a minute to complete on my machine.对于大于 100 的人口,循环需要一分钟以上才能在我的机器上完成。 For a population of 1000 it takes about 12-13 minutes.对于 1000 人来说,大约需要 12-13 分钟。 Based on the various populations I've already done, I expected a population of 5000 to take about 66 minutes.根据我已经完成的各种人口,我预计 5000 人口需要大约 66 分钟。 That's an age, but I was going out so why not run it?那是一个时代,但我要出去了,为什么不跑呢?

This is the loop and the antecedent code I ran for the population of 5000:这是我为 5000 人运行的循环和前面的代码:

# I created sims earlier when I ran my very first population.
sims <- data.frame(baseline=1:1000, ptfive=1:1000, pt75=1:1000, qtrsize=1:1000)
# I did not run it again when I ran the below:

xvalues <- matrix(runif(5000 * 1000) * 10, ncol=1000)
yvalues <- matrix(runif(5000 * 1000) * 10, ncol=1000)

dim(xvalues)

start_time <- Sys.time()
for (i in 1:1000) {
  xval <- xvalues[, i]
  yval <- yvalues[, i]
  
  sims[i, ] <- pwad.grid(xval, yval, 5000)
  #commented out just in case I forget and run all chunks
}
end_time <- Sys.time()
#started at 5:51, expect to finish approx 6:51-

#write.csv(sims, "5000sim.csv")

end_time - start_time

And this is the console output running that (sims aside) generated:这是控制台 output 运行(除了模拟人生)生成:

xvalues <- matrix(runif(5000 * 1000) * 10, ncol=1000)
yvalues <- matrix(runif(5000 * 1000) * 10, ncol=1000)
dim(xvalues)
# [1] 5000 1000
start_time <- Sys.time()
for (i in 1:1000) {
  xval <- xvalues[, i]
  yval <- yvalues[, i]
  
  sims[i, ] <- pwad.grid(xval, yval, 5000)
  #commented out just in case I forget and run all chunks
}
# Warning: stack imbalance in 'for', 2 then -1
end_time <- Sys.time()
end_time - start_time
# Time difference of 1.251416 hours

As you can see, I got a warning (not an error.), Unfortunately.如您所见,不幸的是,我收到了警告(不是错误。)。 because I've been saving the outputs as,csv files.因为我一直将输出保存为 csv 文件。 I haven't used set.seed() so it was the specific set of numbers I used that caused the warning...我没有使用 set.seed() 所以这是我使用的特定数字集导致了警告......

My questions are these:我的问题是:

  • What is " Warning: stack imbalance in 'for', 2 then -1 "?什么是“警告:'for' 中的堆栈不平衡,2 然后 -1 ”?
  • Are my results for the population of 5000 compromised?我对 5000 人口的结果是否受到影响?
  • Why did the warning happen?为什么会发生警告?
  • How might I avoid it if I ran the code for 10000 while watching a movie/overnight?如果我在看电影/过夜时运行 10000 的代码,我该如何避免它?

In searching Google, I see mostly descriptions of "stack imbalance" in the context of Rcpp or different languages.在搜索 Google 时,我看到的主要是 Rcpp 或不同语言上下文中“堆栈不平衡”的描述。 As you can see, I have used only base R functions to build my function, and a for loop, which is also from base R.如您所见,我仅使用基本 R 函数来构建我的 function 和一个 for 循环,它也来自基本 R。

In case it's a memory thing:如果它是 memory 东西:

RStudio 的内存使用报告

But that's post-loop.但那是后循环。 I don't know what it was at before or during running it.我不知道它在运行之前或运行期间是什么。

Not sure about tags, let me know if more details are needed.不确定标签,如果需要更多详细信息,请告诉我。 Many thanks!非常感谢!

Completion predictions from SLR: SLR的完成预测:

cloudnumber <- c(10, 100, 250, 500, 500, 750, 900, 1000, 1250)
yseconds <- c(9.173949, 55.87789, 2.186122 * 60, 4.707054 * 60, 4.606928 * 60, 
              7.831578 * 60, 9.376838 * 60,  12.30255 * 60, 15.15093 * 60)
runtime <- lm(yseconds ~ cloudnumber)
predict(runtime, data.frame(cloudnumber=newdata, type="response")) / 60

Adding in the length for the 5000 population adjusts the predicted 10000 population from 120.62128 minutes to 150.58184 minutes.添加 5000 人口的长度会将预测的 10000 人口从 120.62128 分钟调整为 150.58184 分钟。

The "Warning: stack imbalance in 'for', 2 then -1" message says that something in C/C++ level code is not programmed correctly. “警告:'for' 中的堆栈不平衡,2 然后 -1”消息表示 C/C++ 级别代码中的某些内容未正确编程。 When C level code creates an R variable, it needs to call PROTECT on it, so that the garbage collector doesn't release it.当 C 级别代码创建 R 变量时,需要对其调用PROTECT ,这样垃圾收集器就不会释放它。 At the end of the call it is supposed to make a matching UNPROTECT call so that the object can be freed.在调用结束时,它应该进行匹配的UNPROTECT调用,以便可以释放 object。

R checks on these at the beginning and end of external calls, and warns if the results don't balance. R 在外部调用的开始和结束时检查这些,并在结果不平衡时发出警告。

Now if the code you showed us is all that was running, this is a sign of an internal bug in R.现在,如果您向我们展示的所有代码都在运行,这表明 R 中存在内部错误。 It's unfortunate that your example is not reproducible due to not calling set.seed , and that it takes so long to run: it would be very difficult for anyone else to reproduce it.不幸的是,由于未调用set.seed ,您的示例无法重现,而且运行时间很长:其他任何人都很难重现它。

You ask whether this compromises your results.你问这是否会影响你的结果。 I would say that it could, but of course I don't know that it did.我会说它可以,但我当然不知道它确实如此。

For your next run, you should definitely use set.seed(n) to fix the RNG value to known n at the start.对于您的下一次运行,您绝对应该在开始时使用set.seed(n)将 RNG 值固定为已知n If the warning happens again at least then you can try the identical run and see if it is reproducible.如果警告至少再次发生,那么您可以尝试相同的运行并查看它是否可重现。 Hopefully it will be, and then you can try to debug it: does it happen with a shorter for loop?希望它会,然后你可以尝试调试它:它是否发生在较短的 for 循环中? If you run options(warn=2) to turn the warning into an error, you might be able to narrow it down to exactly which step caused the problem.如果您运行options(warn=2)将警告转换为错误,则可以将其范围缩小到导致问题的确切步骤。 Let us (or the R developers) know if you get something reproducible, and maybe the bug can be fixed.让我们(或 R 开发人员)知道您是否获得了可重现的东西,也许该错误可以修复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM