简体   繁体   English

生成N个统一随机数,总和为1

[英]Generate N uniform random numbers with sum of one

I am trying to generate 100 uniform random numbers in range [0.005, 0.008] with sum of one. 我正在尝试生成100个范围为[0.005,0.008]的统一随机数,总和为1。 I was looking to several questions which were relevant to my concerns but I did not find my answer. 我一直在寻找与自己的担忧相关的几个问题,但没有找到答案。 Could anyone give me a suggestion? 有人可以给我一个建议吗?

To start, I'm going to slightly modify your example, assuming the 100 variables are bounded by [0.008, 0.012] and that they sum to 1 (this ensures there are feasible points in the set you're sampling). 首先,我将稍微修改您的示例,假设100个变量以[0.008,0.012]为界,并且它们的总和为1(这可以确保您要采样的集合中有可行的点)。

The "hit and run" algorithm uniformly samples over a bounded subset of an n-dimensional space. “命中并运行”算法在n维空间的有界子集中统一采样。 For your case, we have n=100 dimensions; 对于您的情况,我们有n = 100个尺寸; let's define corresponding variables x_1, x_2, ..., x_100 . 让我们定义相应的变量x_1, x_2, ..., x_100 Then we have three types of constraints to bound our region of the space we want to sample from. 然后,我们有三种类型的约束来限制我们要从中采样的空间区域。

Variables are lower bounded by 0.008 -- this can be captured by the following linear inequalities: 变量的下限为0.008-可以通过以下线性不等式捕获:

x_1 >= 0.008
x_2 >= 0.008
...
x_100 >= 0.008

Variables are upper bounded by 0.012 -- this can be captured by the following linear inequalities: 变量的上限为0.012-可以通过以下线性不等式捕获:

x_1 <= 0.012
x_2 <= 0.012
...
x_100 <= 0.012

Variables sum to 1 -- this can be captured by: 变量的总和为1-可以通过以下方式捕获:

x_1 + x_2 + ... + x_100 = 1

Let's say we wanted to get 10 sets of variables that are uniformly distributed within our space. 假设我们要获取10组在我们的空间内均匀分布的变量。 Then we can use the hitandrun package in R in the following way: 然后,我们可以通过以下方式在R中使用hitandrun软件包:

library(hitandrun)
n <- 100
lower <- 0.008
upper <- 0.012
s <- 1
constr <- list(constr = rbind(-diag(n), diag(n), rep(1, n), rep(-1, n)),
               dir = rep("<=", 2*n+2),
               rhs = c(rep(-lower, n), rep(upper, n), s, -s))
samples <- hitandrun(constr, n.samples=10)
dim(samples)
# [1]  10 100

Note that this takes quite a long while to run (slightly less than 2 hours in my case) because we are sampling in a high-dimensional space (dimension n=100), and to ensure uniform samples the hit and run algorithm actually performs O(n^3) iterations for each sample it draws. 请注意,这要花很长时间(在我的情况下,不到2小时),因为我们是在高维空间(尺寸n = 100)中进行采样,并且要确保均匀采样,命中并运行算法实际上会执行O (n ^ 3)次迭代,绘制每个样本。 You may be able to decrease the runtime my adjusting the thin parameter to the function, though this could affect the independence of your draws. 您可以通过调整函数的thin参数来减少运行时间,尽管这可能会影响绘制的独立性。

My idea is to generate the random numbers step by step. 我的想法是逐步生成随机数。 In each step take care that the remaining sum is not getting to large, nor to small. 在每个步骤中,请注意剩余的总和不要太小。 In the final step these random numbers are permuted randomly: 在最后一步中,这些随机数被随机排列:

N <- 100

lowerBound <- 0.008
upperBound <- 0.012
Sum        <- 1

X <- rep(NA,N)
remainingSum <- Sum

for (i in 1:(N-1))
{
  a <- max( lowerBound, remainingSum-(N-i)*upperBound )
  b <- min( upperBound, remainingSum-(N-i)*lowerBound )

  A <- ceiling(1e+8*a)
  B <- floor(1e+8*b)

  X[i] <- ifelse( A==B, A, sample(A:B,1)) / 1e+8

  remainingSum <- remainingSum - X[i]
}

X[N] <- remainingSum

X <- sample(X,N)

I'm sorry for the for -loop, but it is a base R solution and it seems to work. 对于for -loop很抱歉,但这是基本的R解决方案,它似乎可以正常工作。

> sum(X)
[1] 1
> min(X)
[1] 0.00801727
> max(X)
[1] 0.01199241
> plot(X)

在此处输入图片说明

The distribution is not exactly, but almost uniform. 分布不完全相同,但是几乎是均匀的。 I repeated the calculation 5000 times and stored the n-th sample in X[,n] : 我重复了5000次计算,并将第n个样本存储在X[,n]

在此处输入图片说明 在此处输入图片说明 在此处输入图片说明

All positions together: 所有职位加在一起:

在此处输入图片说明

Near the lower bound and the upper bound the frequency is increased, but in the rest of the interval between the bounds it is nearly constant. 在下限和上限附近,频率增加,但是在下限之间的其余间隔中,该频率几乎恒定。

Here is an idea how to make the distribution even more uniform: Combine some numbers near the lower and upper boundary and "throw them into the middle": 这是一个使分布更加均匀的想法:组合上下边界附近的一些数字并将它们“扔到中间”:

  • Pick x1 near the lower boundary and x2 near the upper boundary. 在下边界附近选择x1 ,在上边界附近选择x2 Their mean will be approximately the center of the interval. 它们的平均值将大约是间隔的中心。
  • Draw a random number y such that y and x1+x2-y are contained in the interval. 绘制一个随机数y ,以使yx1+x2-y包含在间隔中。
  • Replace x1 and x2 by y and x1+x2-y . yx1+x2-y替换x1x2
  • Repeat until the peaks at the boundaries vanish. 重复直到边界的峰消失。

Without more information about what these numbers will be used for, the problem is ambiguous. 没有有关这些数字将用于什么目的的更多信息,问题就很模糊。 By probing some lower-dimensional examples, we can see that what "uniform" means here is unfortunately vague. 通过研究一些较低维的示例,我们可以看到“统一”在这里的含义很模糊。 If the plan is to use this for some sort of Monte Carlo based simulation, the results you get will most likely not be useful. 如果计划将其用于基于蒙特卡洛的某种模拟,则获得的结果很可能不会有用。

Let's look at the problem with n=4 , constraint [210,300] and total as 1000 . 让我们看一下n=4 ,约束[210,300]且总数为1000

We generate (inefficiently) an exhaustive list of all discrete values that match the criteria 我们生成(效率低下)与标准相符的所有离散值的详尽列表

values <- 210:300
df <- subset(expand.grid(a=values, b=values, c=values, d=values), a+b+c+d==1000)

The distribution of a, b, c, and d will be identical because of symmetry. 由于对称,a,b,c和d的分布将相同。 The distribution looks like 分布看起来像

> plot(prop.table(table(df$a)), type='l')

单变量分布

This problem will only get worse with higher dimensions. 随着尺寸的增加,这个问题只会变得更糟。 The "summing to 1" requirement has the effect of restricting the sampling to an N-1-dimensional hyperplane, and the individual component constraints serve to carve the feasible subset into a polyhedron (based on the intersection of the N-dimensional hypercube with the plane embedded in N-space). “求和为1”的要求具有将采样限制为N-1维超平面的效果,并且各个分量约束用于将可行的子集雕刻为多面体(基于N维超立方体与N维超立方体的交集)。平面嵌入N空间)。

In 3 dimensions, the subspace looks like the intersection of a plane and a cube; 在3维中,子空间看起来像是平面和立方体的交点; so a hexagon in the middle, and triangles on the ends. 中间是六边形,两端是三角形。 Easily verified by looking at the plot of the first two principle components 通过查看前两个主要成分的图即可轻松进行验证

> values <- 100:150; df <- subset(expand.grid(a=values, b=values, c=values), a + b + c==370); df2 <- as.data.frame(predict(princomp(df)))
> plot(df2$Comp.1, df2$Comp.2)

结果的主成分分析

In summary, this problem is much more difficult to reasonably solve than it looks without some knowledge of what the usage intent looks like. 总而言之,在没有某种使用意图的知识的情况下,解决该问题比看起来要困难得多。

Here's a modified Metropolis-Hastings based solution. 这是基于Metropolis-Hastings的改进解决方案。 Note that I'm not hitting convergence yet with your constraint; 请注意,由于您的限制,我还没有达到收敛; but, it's quite close: 但是,它非常接近:

simple_MH <- function(n= 100, low= 0.005, up= 0.02, max_iter= 1000000) {
  x <- runif(n, low, up)
  sum_x <- sum(x)
  iter <- 0

  if (sum_x == 1) return(x)
  else {
    while (sum_x != 1 & iter < max_iter) {
      iter <- iter + 1
      if (sum_x > 1) {
        xt <- sample(which(x > mean(x)), 1)  
      } else {
        xt <- sample(which(x < mean(x)), 1)
      }

      propose <- runif(1, low, up)
      d_prop <- dnorm(propose, 1 / n, sqrt(1/12 *(up - low)^2))
      d_xt   <- dnorm(x[xt], 1 / n, sqrt(1/12 *(up - low)^2))
      alpha <- d_prop / d_xt

      if (alpha >= 1) {
        x[xt] <- propose
        sum_x <- sum(x)
      } else {
        acc <- sample(c(TRUE, FALSE), 1, prob= c(alpha, 1-alpha))
        if (acc) {
          x[xt] <- propose
          sum_x <- sum(x)
        }
      }
    }
  }
  return(list(x=x, iter= iter))
}

# try it out:
test <- simple_MH() # using defaults (note not [0.005, 0.008])
test2 <- simple_MH(max_iter= 5e6)
R> sum(test[[1]]) # = 1.003529
R> test[[2]] # hit max of 1M iterations
R> sum(test2[[1]]) # = 0.9988
R> test2[[2]] # hit max of 5M iterations

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM