简体   繁体   中英

Generate random numpy array from a given list of elements with at least one repetition of each element

I want to create an array (say output_list ) from a given numpy (say input_list ) after resampling such that each element from input_list exists in output_list at least once. The length of output_list will be always > the length of input_list.

I tried a few approaches, and I am looking for a faster method. Unfortunately, numpy 's random.choice doesn't guarantee that at least one element exists.

Step 1: Generate Data

import string
import random
import numpy as np

size = 150000
chars = string.digits + string.ascii_lowercase
input_list= [
            "".join(
                [random.choice(chars) for i in range(5)]
            ) for j in range(dict_data[1]['unique_len'])]

Option 1: Let's try numpy 's random.choice with uniform distribution in terms of probability.

output_list = np.random.choice(
    input_list,
    size=output_size,
    replace=True,
    p=[1/input_list.__len__()]*input_list.__len__()
    )
assert set(input_list).__len__()==set(output_list).__len__(),\
    "Output list has fewer elements than input list"

This raises assertion:

Output list has fewer elements than input list

Option 2 Let's pad random numbers to input_list and then shuffle it.

output_list = np.concatenate((np.array(input_list),np.random.choice(
    input_list,
    size=output_size-input_list.__len__(),
    replace=True,
    p=[1/input_list.__len__()]*input_list.__len__()
)),axis=None)

np.random.shuffle(output_list)
assert set(input_list).__len__()==set(output_list).__len__(),\
    "Output list has fewer elements than input list"

While this doesn't raise any assertion, I am looking for a faster solution than this either algorithmically or using numpy 's in-built function.

Thanks for any help.

Let lenI is input list length, lenO is output list length.

1) Make lenO - lenI iterations of uniform random choice from source list

2) Then append all input list in the end of output list

3) Then make lenI iterations of Fisher–Yates shuffle to distribute last elements uniformly.

import random
src = [1, 2, 3, 4]
lD = 10
lS = len(src)
dst = []
for _ in range(lD - lS):
    dst.append(src[random.randint(0, lS-1)])
dst.extend(src)
print(dst)
for i in range(lD - 1, lD - lS - 1, -1):
    r = random.randint(0, lD - 1)
    dst[r], dst[i] = dst[i], dst[r]
print(dst)

>>[4, 3, 1, 3, 4, 3, 1, 2, 3, 4]
>>[4, 3, 1, 3, 4, 3, 1, 3, 4, 2]

This is approach with linear complexity.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM