简体   繁体   English

numpy.random.choice 上的不同随机选择

[英]different random choices on numpy.random.choice

I am using the function numpy.random.choice for generating random samples at once.我正在使用 function numpy.random.choice 一次生成随机样本。 But I'd like all the samples to be different.但我希望所有样本都不同。 Is somebody aware of a function doing this?有人知道 function 这样做吗? Explicitly, I'd like to have this:明确地说,我想要这个:

import numpy as np
a = np.random.choice(62, size=(1000000, 8))
assert( len(set([tuple(a[i]) for i in range(a.shape[0])])) == a.shape[0])

The values on the integers can be replaced.可以替换整数上的值。 The only which is required is that all row entries to be different.唯一需要的是所有行条目都不同。

First things first, if you have a numpy version >= 1.17 avoid using np.random.choice for the recommended method :首先,如果您有 numpy 版本 >= 1.17,请避免使用np.random.choice作为推荐方法

rng = np.random.default_rng()
rng.choice

Ironically enough, doing what you did is the best way to go about it.具有讽刺意味的是,做你所做的事情是go 关于它的最佳方式。 Just generate all of the numbers and make a check that it satisfies your restrictions.只需生成所有数字并检查它是否满足您的限制。

samples = 1000000
while True:
    a = np.random.choice(62, size=(samples, 8))
    if len(set(tuple(row) for row in a)) == samples:
        break

The reason for that is each sample has 8 values, where each value can take up to 62 different values.原因是每个样本都有 8 个值,其中每个值最多可以取 62 个不同的值。 So effectively you have 62**8 unique samples.所以有效地你有 62**8 个独特的样本。 This is such a huge number compared to the 1 million samples you want to draw and considering the birthday problem they will all be unique 99.8% of the time.与您要绘制的 100 万个样本相比,这是一个巨大的数字,考虑到生日问题,它们在 99.8% 的时间里都是唯一的。 And if they are not, a second draw virtually guarantees that.如果他们不是,第二次抽签几乎可以保证这一点。 You won't find yourself caught in an infinite loop.您不会发现自己陷入无限循环。

Normally the way you'd go about this is drawing each sample in a loop and check if it has been encountered before.通常,您对 go 的处理方式是在循环中绘制每个样本并检查之前是否遇到过。

seen = set()
draws = []
while len(draws) < samples:
    draw = tuple(np.random.choice(62, size=8))
    if draw not in seen:
        seen.add(draw)
        draws.append(draw)
a = np.array(draws)

This turns out to be much slower because of the python loops and the numerous calls to np.random.choice .由于 python 循环和对np.random.choice的大量调用,这结果要慢得多。 On my machine this clocks 15 seconds compared to the method above which only takes 2 seconds.在我的机器上,这需要 15 秒,而上面的方法只需要 2 秒。 Now, if the first method creates duplicate samples so frequently that you'll be in that loop for more than 7-8 iterations the second method becomes more efficient.现在,如果第一种方法如此频繁地创建重复样本,以至于您将在该循环中进行超过 7-8 次迭代,则第二种方法会变得更有效。 But this isn't your case for the reason explained above.但由于上述原因,这不是你的情况。

Edit编辑

A hybrid approach would be to generate all the numbers like in the first method but then instead of creating a set of the samples, use a dict to track in which row each sample has been encountered.一种混合方法是像第一种方法一样生成所有数字,但不是创建一组样本,而是使用 dict 来跟踪每个样本在哪一行遇到。 Then if there are any duplicates, you won't have to generate a whole new array, but just replace a few individual samples.然后,如果有任何重复,您不必生成一个全新的数组,而只需替换几个单独的样本。

from collections import defaultdict
import numpy as np

value = 20
samples = 1000000
length = 8

a = np.random.choice(value, size=(samples, length))
d = defaultdict(list)
for i, row in enumerate(a):
    d[tuple(row)].append(i)
if len(d) < samples:
    print(f'Found {samples - len(d)} duplicates')
    idx = []
    for rows in d.values():
        if len(rows) > 1:
            idx.extend(rows[1:])
            del rows[1:]
    while idx:
        draw = np.random.choice(value, size=length)
        if t := tuple(draw) not in d:
            d[t].append(idx[-1])
            a[idx.pop()] = draw
print('Done')

Again, for value = 62 you will very likely be fine with one draw.同样,对于value = 62 ,您很可能只需一次平局即可。 But for value = 20 it generates on average 20 duplicates with near certainty.但是对于value = 20 ,它几乎可以肯定地平均生成 20 个重复项。 It is thus faster to replace those few samples with new unique ones instead of using the second method from above.因此,用新的独特样本替换这几个样本比使用上面的第二种方法更快。 By the time you increase the value to value = 30 , it's almost a 50-50 whether you'll get a duplicate or not.当您将值增加到value = 30时,无论您是否会得到重复,这几乎是 50-50。 While this approach has a lot more code in it, it retains a lot of the speed advantages by just generating the whole array in one go.虽然这种方法有更多的代码,但它保留了很多速度优势,只需在一个 go 中生成整个数组。

In your case I would still use the top suggested method because it's so unlikely to generate any duplicates that the only reason you even spend a line for a sanity check is just for the unthinkable.在您的情况下,我仍然会使用建议的最佳方法,因为它不太可能生成任何重复项,以至于您甚至花费一行进行完整性检查的唯一原因只是为了不可思议。 No need to complicate matters more.没有必要让事情变得更复杂。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM