简体   繁体   English

生成 numpy.random.choice 的二维数组,无需替换

[英]generate a 2D array of numpy.random.choice without replacement

I'm tyring to make my code faster by removing some for loops and using arrays.我正在努力通过删除一些 for 循环和使用数组来使我的代码更快。 The slowest step right now is the generation of the random lists.现在最慢的一步是随机列表的生成。

context: I have a number of mutations in a chromosome, i want to perform 1000 random "chromosomes" with the same length and same number of mutation but their positions are randomized.上下文:我在染色体中有许多突变,我想执行 1000 条具有相同长度和相同突变数量的随机“染色体”,但它们的位置是随机的。

here is what I'm currently running to generate these randomized mutation positions:这是我目前正在运行以生成这些随机突变位置:

iterations=1000
Chr_size=1000000
num_mut=500
randbps=[]
for k in range(iterations):
listed=np.random.choice(range(Chr_size),num_mut,replace=False)
randbps.append(listed)

I want to do something similar to what they cover in this question我想做一些类似于他们在这个问题中所涵盖的内容

np.random.choice(range(Chr_size),size=(num_mut,iterations),replace=False)

however without replacement applies to the array as a whole.但是没有替换适用于整个数组。

further context: later in the script i go through each randomized chromosome and count the number of mutations in a given window:进一步的背景:稍后在脚本中,我遍历每个随机染色体并计算给定窗口中的突变数:

for l in range(len(randbps)):

    arr=np.asarray(randbps[l])

    for i in range(chr_last_window[f])[::step]:

        counter=((i < arr) & (arr < i+window)).sum()

Based on the trick used in this solution , here's an approach that uses argsort/argpartition on an array of random elements to simulate numpy.random.choice without replacement to give us randbps as a 2D array -基于this solution使用的技巧,这是一种在随机元素数组上使用argsort/argpartition来模拟numpy.random.choice without replacement从而将randbps作为 2D 数组提供给我们 -

np.random.rand(iterations,Chr_size).argpartition(num_mut)[:,:num_mut]

Runtime test -运行时测试 -

In [2]: def original_app(iterations,Chr_size,num_mut):
   ...:     randbps=[]
   ...:     for k in range(iterations):
   ...:         listed=np.random.choice(range(Chr_size),num_mut,replace=False)
   ...:         randbps.append(listed)
   ...:     return randbps
   ...: 

In [3]: # Input params (scaled down version of params listed in question)
   ...: iterations=100
   ...: Chr_size=100000
   ...: num=50
   ...: 

In [4]: %timeit original_app(iterations,Chr_size,num)
1 loops, best of 3: 1.53 s per loop

In [5]: %timeit np.random.rand(iterations,Chr_size).argpartition(num)[:,:num]
1 loops, best of 3: 424 ms per loop

I don't know how np.random.choice is implemented but I am guessing it is optimized for a general case.我不知道 np.random.choice 是如何实现的,但我猜它是针对一般情况进行了优化。 Your numbers, on the other hand, are not likely to produce the same sequences.另一方面,您的数字不太可能产生相同的序列。 Sets may be more efficient for this case, building from scratch:对于这种情况,从头开始构建集合可能更有效:

import random

def gen_2d(iterations, Chr_size, num_mut):
    randbps = set()
    while len(randbps) < iterations:
        listed = set()
        while len(listed) < num_mut:
            listed.add(random.choice(range(Chr_size)))
        randbps.add(tuple(sorted(listed)))
    return np.array(list(randbps))

This function starts with an empty set, generates a single number in range(Chr_size) and adds that number to the set.这个函数从一个空集开始,在 range(Chr_size) 中生成一个数字并将该数字添加到集合中。 Because of the properties of the sets it cannot add the same number again.由于集合的属性,它不能再次添加相同的数字。 It does the same thing for the randbps as well so each element of randbps is also unique.它对 randbps 也做同样的事情,所以 randbps 的每个元素也是唯一的。

For only one iteration of np.random.choice vs gen_2d:仅对 np.random.choice 与 gen_2d 的一次迭代:

iterations=1000
Chr_size=1000000
num_mut=500

%timeit np.random.choice(range(Chr_size),num_mut,replace=False)
10 loops, best of 3: 141 ms per loop

%timeit gen_2d(1, Chr_size, num_mut)
1000 loops, best of 3: 647 µs per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM