简体   繁体   中英

Generate N uniform random numbers with sum of one

I am trying to generate 100 uniform random numbers in range [0.005, 0.008] with sum of one. I was looking to several questions which were relevant to my concerns but I did not find my answer. Could anyone give me a suggestion?

To start, I'm going to slightly modify your example, assuming the 100 variables are bounded by [0.008, 0.012] and that they sum to 1 (this ensures there are feasible points in the set you're sampling).

The "hit and run" algorithm uniformly samples over a bounded subset of an n-dimensional space. For your case, we have n=100 dimensions; let's define corresponding variables x_1, x_2, ..., x_100 . Then we have three types of constraints to bound our region of the space we want to sample from.

Variables are lower bounded by 0.008 -- this can be captured by the following linear inequalities:

x_1 >= 0.008
x_2 >= 0.008
...
x_100 >= 0.008

Variables are upper bounded by 0.012 -- this can be captured by the following linear inequalities:

x_1 <= 0.012
x_2 <= 0.012
...
x_100 <= 0.012

Variables sum to 1 -- this can be captured by:

x_1 + x_2 + ... + x_100 = 1

Let's say we wanted to get 10 sets of variables that are uniformly distributed within our space. Then we can use the hitandrun package in R in the following way:

library(hitandrun)
n <- 100
lower <- 0.008
upper <- 0.012
s <- 1
constr <- list(constr = rbind(-diag(n), diag(n), rep(1, n), rep(-1, n)),
               dir = rep("<=", 2*n+2),
               rhs = c(rep(-lower, n), rep(upper, n), s, -s))
samples <- hitandrun(constr, n.samples=10)
dim(samples)
# [1]  10 100

Note that this takes quite a long while to run (slightly less than 2 hours in my case) because we are sampling in a high-dimensional space (dimension n=100), and to ensure uniform samples the hit and run algorithm actually performs O(n^3) iterations for each sample it draws. You may be able to decrease the runtime my adjusting the thin parameter to the function, though this could affect the independence of your draws.

My idea is to generate the random numbers step by step. In each step take care that the remaining sum is not getting to large, nor to small. In the final step these random numbers are permuted randomly:

N <- 100

lowerBound <- 0.008
upperBound <- 0.012
Sum        <- 1

X <- rep(NA,N)
remainingSum <- Sum

for (i in 1:(N-1))
{
  a <- max( lowerBound, remainingSum-(N-i)*upperBound )
  b <- min( upperBound, remainingSum-(N-i)*lowerBound )

  A <- ceiling(1e+8*a)
  B <- floor(1e+8*b)

  X[i] <- ifelse( A==B, A, sample(A:B,1)) / 1e+8

  remainingSum <- remainingSum - X[i]
}

X[N] <- remainingSum

X <- sample(X,N)

I'm sorry for the for -loop, but it is a base R solution and it seems to work.

> sum(X)
[1] 1
> min(X)
[1] 0.00801727
> max(X)
[1] 0.01199241
> plot(X)

在此处输入图片说明

The distribution is not exactly, but almost uniform. I repeated the calculation 5000 times and stored the n-th sample in X[,n] :

在此处输入图片说明 在此处输入图片说明 在此处输入图片说明

All positions together:

在此处输入图片说明

Near the lower bound and the upper bound the frequency is increased, but in the rest of the interval between the bounds it is nearly constant.

Here is an idea how to make the distribution even more uniform: Combine some numbers near the lower and upper boundary and "throw them into the middle":

  • Pick x1 near the lower boundary and x2 near the upper boundary. Their mean will be approximately the center of the interval.
  • Draw a random number y such that y and x1+x2-y are contained in the interval.
  • Replace x1 and x2 by y and x1+x2-y .
  • Repeat until the peaks at the boundaries vanish.

Without more information about what these numbers will be used for, the problem is ambiguous. By probing some lower-dimensional examples, we can see that what "uniform" means here is unfortunately vague. If the plan is to use this for some sort of Monte Carlo based simulation, the results you get will most likely not be useful.

Let's look at the problem with n=4 , constraint [210,300] and total as 1000 .

We generate (inefficiently) an exhaustive list of all discrete values that match the criteria

values <- 210:300
df <- subset(expand.grid(a=values, b=values, c=values, d=values), a+b+c+d==1000)

The distribution of a, b, c, and d will be identical because of symmetry. The distribution looks like

> plot(prop.table(table(df$a)), type='l')

单变量分布

This problem will only get worse with higher dimensions. The "summing to 1" requirement has the effect of restricting the sampling to an N-1-dimensional hyperplane, and the individual component constraints serve to carve the feasible subset into a polyhedron (based on the intersection of the N-dimensional hypercube with the plane embedded in N-space).

In 3 dimensions, the subspace looks like the intersection of a plane and a cube; so a hexagon in the middle, and triangles on the ends. Easily verified by looking at the plot of the first two principle components

> values <- 100:150; df <- subset(expand.grid(a=values, b=values, c=values), a + b + c==370); df2 <- as.data.frame(predict(princomp(df)))
> plot(df2$Comp.1, df2$Comp.2)

结果的主成分分析

In summary, this problem is much more difficult to reasonably solve than it looks without some knowledge of what the usage intent looks like.

Here's a modified Metropolis-Hastings based solution. Note that I'm not hitting convergence yet with your constraint; but, it's quite close:

simple_MH <- function(n= 100, low= 0.005, up= 0.02, max_iter= 1000000) {
  x <- runif(n, low, up)
  sum_x <- sum(x)
  iter <- 0

  if (sum_x == 1) return(x)
  else {
    while (sum_x != 1 & iter < max_iter) {
      iter <- iter + 1
      if (sum_x > 1) {
        xt <- sample(which(x > mean(x)), 1)  
      } else {
        xt <- sample(which(x < mean(x)), 1)
      }

      propose <- runif(1, low, up)
      d_prop <- dnorm(propose, 1 / n, sqrt(1/12 *(up - low)^2))
      d_xt   <- dnorm(x[xt], 1 / n, sqrt(1/12 *(up - low)^2))
      alpha <- d_prop / d_xt

      if (alpha >= 1) {
        x[xt] <- propose
        sum_x <- sum(x)
      } else {
        acc <- sample(c(TRUE, FALSE), 1, prob= c(alpha, 1-alpha))
        if (acc) {
          x[xt] <- propose
          sum_x <- sum(x)
        }
      }
    }
  }
  return(list(x=x, iter= iter))
}

# try it out:
test <- simple_MH() # using defaults (note not [0.005, 0.008])
test2 <- simple_MH(max_iter= 5e6)
R> sum(test[[1]]) # = 1.003529
R> test[[2]] # hit max of 1M iterations
R> sum(test2[[1]]) # = 0.9988
R> test2[[2]] # hit max of 5M iterations

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM