简体   繁体   English

随机 numpy 数组,其值介于 -1 和 1 之间且总和为 1

[英]random numpy array whose values are between -1 and 1 and sum to 1

what is the best way to create a NumPy array x of a given size with values randomly (and uniformly?) spread between -1 and 1 , and that also sum to 1 ?创建给定size的 NumPy 数组x的最佳方法是什么,其值随机(均匀?)分布在-11之间,并且总和为1

I tried 2*np.random.rand(size)-1 and np.random.uniform(-1,1,size) based on the discussion here , but if I take a transformation approach, by re-scaling both methods by their sum afterwards, x/=np.sum(x) , this ensures the elements sum to 1, but: there are elements in the array that are suddenly much greater or less than 1 (> 1 , < -1 ) which is not wanted.我根据这里的讨论尝试了2*np.random.rand(size)-1np.random.uniform(-1,1,size) ,但是如果我采用转换方法,通过它们重新缩放这两种方法之后求和, x/=np.sum(x) ,这确保元素总和为 1,但是:数组中的元素突然大于或小于 1 (> 1 , < -1 ),这是不想要的.

In this case, let's let a uniform distribution start the process, but adjust the values to give a sum of 1. For sake of illustration, I'll use an initial step of [-1, -0.75, 0, 0.25, 1] This gives us a sum of -0.5, but we require 1.0在这种情况下,让我们让均匀分布开始该过程,但调整值使总和为 1。为了说明起见,我将使用[-1, -0.75, 0, 0.25, 1]的初始步骤这给了我们 -0.5 的总和,但我们需要 1.0

STEP 1 : Compute the amount of total change needed: 1.0 - (-0.5) = 1.5 .第 1步:计算所需的总变化量: 1.0 - (-0.5) = 1.5

Now, we will apportion that change among the elements of the distribution is some appropriate fashion.现在,我们将分配分布元素之间的变化是某种适当的方式。 One simply method I've used is to move middle elements the most, while keeping the endpoints stable.我使用的一种简单方法是最大程度地移动中间元素,同时保持端点稳定。

STEP 2 : Compute the difference of each element from the nearer endpoint.第 2步:计算每个元素与较近端点的差异。 For your nice range, this is 1 - abs(x)对于你的好范围,这是1 - abs(x)

STEP 3 : sum these differences.第 3 步:总结这些差异。 Divide into the required change.划分为所需的更改。 That gives the amount to adjust each element.这给出了调整每个元素的数量。

Putting this much into a chart:把这么多放在一个图表中:

  x    diff  adjust
-1.0   0.00  0.0
-0.75  0.25  0.1875
 0.0   1.0   0.75
 0.25  0.75  0.5625
 1.0   0.0   0.0

Now, simply add the x and adjust columns to get the new values:现在,只需添加xadjust列即可获得新值:

 x    adjust  new
-1.0  0.0     -1.0
-0.75 0.1875  -0.5625
 0    0.75     0.75
 0.25 0.5625   0.8125
 1.0  0.0      1.0

There is your adjusted data set: a sum of 1.0, the endpoints intact.有您调整后的数据集:总和为 1.0,端点完好无损。


Simple python code:简单的python代码:

x = [-1, -0.75, 0, 0.25, 1.0]
total = sum(x)
diff = [1 - abs(q) for q in x]
total_diff = sum(diff)
needed = 1.0 - sum(x)

adjust = [q * needed / total_diff for q in diff]
new = [x[i] + adjust[i] for i in range(len(x))]
for i in range(len(x)):
    print(f'{x[i]:8} {diff[i]:8} {adjust[i]:8} {new[i]:8}')
print (new, sum(new))

Output:输出:

      -1        0      0.0     -1.0
   -0.75     0.25   0.1875  -0.5625
       0        1     0.75     0.75
    0.25     0.75   0.5625   0.8125
     1.0      0.0      0.0      1.0
[-1.0, -0.5625, 0.75, 0.8125, 1.0] 1.0

I'll let you vectorize this in NumPy.我会让你在 NumPy 中对它进行矢量化。

You can create two different arrays for positive and negative values.您可以为正值和负值创建两个不同的数组。 Make sure the positive side adds up to 1 and negative side adds up to 0.确保正侧加起来为 1,负侧加起来为 0。

import numpy as np
size = 10
x_pos = np.random.uniform(0, 1, int(np.floor(size/2)))
x_pos = x_pos/x_pos.sum() 
x_neg = np.random.uniform(0, 1, int(np.ceil(size/2)))
x_neg = x_neg - x_neg.mean()

x = np.concatenate([x_pos, x_neg])
np.random.shuffle(x)

print(x.sum(), x.max(), x.min())
>>> 0.9999999999999998 0.4928358768227867 -0.3265210342316333

print(x)
>>>[ 0.49283588  0.33974127 -0.26079784  0.28127281  0.23749531 -0.32652103
  0.12651658  0.01497403 -0.03823131  0.13271431]

Rejection sampling拒绝抽样

You can use rejection sampling .您可以使用拒绝抽样 The method below does this by sampling in a space of 1 dimension less than the original space.下面的方法通过在比原始空间小一维的空间中进行采样来实现这一点。

  • Step 1: you sample x(1), x(2), ..., x(n-1) randomly by sampling each x(i) from a uniform distribution第 1 步:通过从均匀分布中对每个 x(i) 进行采样来随机采样 x(1), x(2), ..., x(n-1)
  • Step 2: if the sum S = x(1) + x(2) + ... + x(n-1) is below 0 or above 2 then reject and start again in Step 1.步骤 2:如果总和 S = x(1) + x(2) + ... + x(n-1) 小于 0 或大于 2,则拒绝并在步骤 1 中重新开始。
  • Step 3: compute the n-th variable as x(n) = 1-S第 3 步:计算第 n 个变量为 x(n) = 1-S

Intuition直觉

You can view the vector x(1), x(2), ..., x(n-1), x(n) on the interior of a n-dimensional cube with cartesian coordinates ±1, ±1,... , ±1.您可以在笛卡尔坐标为 ±1、±1、.. 的 n 维立方体内部查看向量 x(1)、x(2)、...、x(n-1)、x(n)。 ,±1。 Such that you follow the constraints -1 <= x(i) <= 1.这样您就可以遵循约束 -1 <= x(i) <= 1。

The additional constraint that the sum of the coordinates must equal 1 constrains the coordinates to a smaller space than the hypercube and will be a hyperplane with dimension n-1.坐标总和必须等于 1 的附加约束将坐标约束到比超立方体更小的空间,并且将是维度为 n-1 的超平面

If you do regular rejection sampling, sampling from uniform distribution for all the coordinates, then you will never hit the constraint.如果您定期进行拒绝采样,从所有坐标的均匀分布中进行采样,那么您将永远不会遇到约束。 The sampled point will never be in the hyperplane.采样点永远不会在超平面中。 Therefore you consider a subspace of n-1 coordinates.因此,您考虑一个 n-1 坐标的子空间。 Now you can use rejection sampling.现在您可以使用拒绝抽样。

Visually视觉上

Say you have dimension 4 then you could plot 3 of the coordinated from the 4. This plot (homogeneously) fills a polyhedron.假设你有 4 维,那么你可以从 4 中绘制 3 维。这个图(均匀地)填充了一个多面体。 Below this is illustrated by plotting the polyhedron in slices.下面通过在切片中绘制多面体来说明。 Each slice corresponds to a different sum S = x(1) + x(2) + ... + x(n-1) and a different value for x(n).每个切片对应于不同的总和 S = x(1) + x(2) + ... + x(n-1) 和不同的 x(n) 值。

3 个坐标域

Image: domain for 3 coordinates.图像:3 个坐标域。 Each colored surface relates to a different value for the 4-th coordinate.每个彩色表面与第 4 坐标的不同值相关。

Marginal distributions边际分布

For large dimensions, rejection sampling will become less efficient because the fraction of rejections grows with the number of dimensions.对于大维度,拒绝抽样的效率会降低,因为拒绝的比例随着维度数的增加而增加。

One way to 'solve' this would be by sampling from the marginal distributions. “解决”这个问题的一种方法是从边缘分布中抽样。 However, it is a bit tedious to compute these marginal distributions.然而,计算这些边缘分布有点乏味。 Comparison: For generating samples from a Dirichlet distribution a similar algorithm exists, but in that case, the marginal distributions are relatively easy.比较:对于从 Dirichlet 分布生成样本,存在类似的算法,但在这种情况下,边缘分布相对容易。 (however, it is not impossible to derive these distributions, see below 'relationship with Irwin Hall distribution') (但是,推导出这些分布并非不可能,请参阅下面的“与欧文霍尔分布的关系”)

In the example above the marginal distribution of the x(4) coordinate corresponds to the surface area of the cuts.在上面的例子中,x(4) 坐标的边缘分布对应于切割的表面积。 So for 4 dimensions, you might be able to figure out the computation based on that figure (you'd need to compute the area of those irregular polygons) but it starts to get more complicated for larger dimensions.因此,对于 4 维,您可能能够根据该图计算出计算量(您需要计算那些不规则多边形的面积),但对于更大的维度,它开始变得更加复杂。

Relationship with Irwin Hall distribution与欧文霍尔分布的关系

To get the marginal distributions you can use truncated Irwin Hall distributions .要获得边缘分布,您可以使用截断的Irwin Hall 分布 The Irwin Hall distribution is is the distribution of a sum of uniform distributed variables and will follow some piecewise polynomial shape. Irwin Hall 分布是均匀分布变量之和的分布,将遵循一些分段多项式形状。 This is demonstrated below for one example.下面以一个例子来说明这一点。

Code代码

Since my python is rusty I will mostly add R code.由于我的python生锈,我将主要添加R代码。 The algorithm is very basic and so I imagine that any Python coder can easily adapt it into Python code.该算法非常基础,因此我想任何 Python 编码人员都可以轻松地将其改编为 Python 代码。 The hard part of the question seems to me to be more about the algorithm than about how to code in Python (although I am not a Python coder so I leave that up to others).在我看来,问题的难点在于算法而不是如何在 Python 中编码(尽管我不是 Python 编码员,所以我将其留给其他人)。

边际分布示例

Image: output from sampling.图像:采样输出。 The 4 black curves are marginal distributions for the four coordinates. 4 条黑色曲线是四个坐标的边缘分布。 The red curve is a computation based on an Irwin Hall distribution.红色曲线是基于欧文霍尔分布的计算。 This can be extended to a sampling method by computing directly instead of rejection sampling.这可以通过直接计算而不是拒绝采样来扩展为一种采样方法。

The rejection sampling in python python中的拒绝采样

import numpy as np

def sampler(size):
   reject = 1
   while reject:
      x = np.random.rand(size - 1) # step 1
      S = np.sum(x)
      reject = (S<0) or (S>2)      # step 2
   x = np.append(x,1-S)            # step 3
   return[x]

y = sampler(5) 
print(y, np.sum(y))

Some more code in R, including the comparison with the Irwin Hall distribution. R 中的更多代码,包括与 Irwin Hall 分布的比较。 This distribution can be used to compute the marginal distributions and can be used to devise an algorithm to that is more efficient than rejection sampling.该分布可用于计算边缘分布,并可用于设计比拒绝采样更有效的算法。

### function to do rejection sample
samp <- function(n) {
  S <- -1
  ## a while loop that performs step 1 (sample) and 2 (compare sum)
  while((S<0) || (S>2) ) { 
    x <- runif(n-1,-1,1)
    S <- sum(x)
  }
  x <- c(x,1-S) ## step 3 (generate n-th coordinate)
  x
}

### compute 10^5 samples
y <- replicate(10^5,samp(4))

### plot histograms
h1 <- hist(y[1,], breaks = seq(-1,1,0.05))
h2 <- hist(y[2,], breaks = seq(-1,1,0.05))
h3 <- hist(y[3,], breaks = seq(-1,1,0.05))
h4 <- hist(y[4,], breaks = seq(-1,1,0.05))

### histograms together in a line plot
plot(h1$mids,h1$density, type = 'l', ylim = c(0,1),
     xlab = "x[i]", ylab = "frequency", main = "marginal distributions")
lines(h2$mids,h2$density)
lines(h3$mids,h3$density)
lines(h4$mids,h4$density)

### add distribution based on Irwin Hall distribution

### Irwin Hall PDF
dih <- function(x,n=3) {
  k <- 0:(floor(x))   
  terms <- (-1)^k * choose(n,k) *(x-k)^(n-1)
  sum(terms)/prod(1:(n-1))
}
dih <- Vectorize(dih)

### Irwin Hall CDF
pih <- function(x,n=3) {
  k <- 0:(floor(x))   
  terms <- (-1)^k * choose(n,k) *(x-k)^n
  sum(terms)/prod(1:(n))
}
pih <- Vectorize(pih)


### adding the line 
### (note we need to scale the variable for the Erwin Hall distribution)
xn <- seq(-1,1,0.001)

range <- c(-1,1)
cum <- pih(1.5+(1-range)/2,3)
scale <- 0.5/(cum[1]-cum[2]) ### renormalize
                           ### (the factor 0.5 is due to the scale difference)
lines(xn,scale*dih(1.5+(1-xn)/2,3),col = 2)

You have coded an algebraic contradiction.你已经编码了一个代数矛盾。 The assumption of the question you cite is that the random sample will approximately fill the range [-1, 1].您引用的问题的假设是随机样本将大致填充范围 [-1, 1]。 If you re-scale linearly, it is algebraically impossible to maintain that range unless the sum is 1 before scaling, such that the scaling makes no changes.如果您线性地重新缩放,除非缩放总和为 1,否则在代数上不可能保持该范围,这样缩放就不会发生变化。

You have two immediate choices here:您在这里有两个直接选择:

  1. Surrender the range idea.放弃范围的想法。 Make a simple change to ensure that the sum will be at least 1, and accept a smaller range after scaling.做一个简单的更改,确保总和至少为1,并在缩放后接受更小的范围。 You can do this in any way you like that skews the choices toward the positive side.你可以用任何你喜欢的方式来做这件事,让选择偏向积极的一面。
  2. Change your original "random" selection algorithm such that it tends to maintain a sum near to 1, and then add a final element that returns it to exactly 1.0.更改您原来的“随机”选择算法,使其总和趋于保持接近 1,然后添加一个最终元素,使其恰好返回 1.0。 Then you don't have to re-scale.然后您不必重新缩放。

Consider basic interval algebra.考虑基本的区间代数。 If you begin with the interval (range) of [-1,1] and multiply by a (which would be 1/sum(x) for you), then the resulting interval is [-a,a] .如果您从[-1,1]的区间(范围)开始并乘以a (对您来说就是1/sum(x) ),那么结果区间是[-a,a] If a > 1 , as in your case, the resulting interval is larger.如果a > 1 ,就像你的情况一样,结果间隔更大。 If a < 0 , then the ends of the interval are swapped.如果a < 0 ,则交换区间的两端。


From your comments, I infer that your conceptual problem is a bit more subtle.从您的评论中,我推断您的概念问题更加微妙。 You are trying to force a distribution with an expected value of 0 to yield a sum of 1. This is unrealistic until you agree to somehow skew that distribution without certain bounds.您试图强制预期值为0的分布产生总和为 1。这是不现实的,除非您同意以某种方式在没有特定界限的情况下扭曲该分布。 So far, you have declined my suggestions, but have not offered anything you will accept.到目前为止,你拒绝了我的建议,但没有提供任何你接受的东西。 Until you identify that, I cannot reasonably suggest a solution for you.在您确定这一点之前,我无法合理地为您提出解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM