简体   繁体   English

python中带权重的随机洗牌

[英]Random shuffle with weight in python

I am currently trying to shuffle an array and am running into some problems.我目前正在尝试对数组进行洗牌,但遇到了一些问题。

What I have:我拥有的:

my_array=array([nan, 1, 1, nan, nan, 2, nan, ..., nan, nan, nan])

What I want to do:我想做的事:
I want to shuffle the dataset while keeping the numbers (eg the 1,1 in the array) together.我想打乱数据集,同时将数字(例如数组中的1,1 )保持在一起。 What I did is first converting every nan into an unique negative number.我所做的是首先将每个nan转换为唯一的负数。

my_array=array([-1, 1, 1, -2, -3, 2, -4, ..., -2158, -2159, -2160])

Afterward I split everything up with pandas:之后我把所有东西都用熊猫分开了:

df = pd.DataFrame(my_array)
df.rename(columns={0: 'sampleID'}, inplace=True)
groups = [df.iloc[:, 0] for _, df in df.groupby('sampleID')]

If I know shuffle my dataset I will have an equal probability for every group to appear at a given place, but this would neglect the number of elements in each group.如果我知道 shuffle 我的数据集,我将有相同的概率让每个组出现在给定的位置,但这会忽略每个组中的元素数量。 If I have a group of several elements like [9,9,9,9,9,9] it should have a higher chance at appearing earlier than some random nan .如果我有一组像[9,9,9,9,9,9]这样的几个元素,它应该比一些随机的nan更早出现。 Correct me on this one if I'm wrong.如果我错了,请纠正我。
One way to get around this problem is numpys choice method.解决这个问题的一种方法是 numpys 选择方法。 For this I have to create a probability array为此,我必须创建一个概率数组

probability_array = np.zeros(len(groups))

for index, item in enumerate(groups):
    probability_array[index] = len(item) / len(groups)

All of this to finally call:所有这一切最终调用:

groups=np.array(groups,dtype=object)
rng = np.random.default_rng()
shuffled_indices = rng.choice(len(groups), len(groups), replace=False, p=probability_array)
shuffled_array = np.concatenate(groups[shuffled_indices]).ravel()
shuffled_array[shuffled_array < 1] = np.NaN

All of this is quite cumbersome and not very fast.所有这些都非常麻烦,而且速度不是很快。 Besides the fact that you can certainly code it better, I feel like I am missing some very simple solution to my problem.除了您当然可以更好地编写代码这一事实之外,我觉得我缺少一些非常简单的问题解决方案。 Can somebody point me in the right direction?有人可以指出我正确的方向吗?

One approach:一种方法:

import numpy as np
from itertools import groupby

# toy data
my_array = np.array([np.nan, 1, 1, np.nan, np.nan, 2, 2, 2, np.nan, 3, 3, 3, np.nan, 4, 4, np.nan, np.nan])

# find groups
groups = np.array([[key, sum(1 for _ in group)] for key, group in groupby(my_array)])

# permute
keys, repetitions = zip(*np.random.permutation(groups))

# recreate new array
res = np.repeat(keys, repetitions)
print(res)

Output (single run)输出(单次运行)

[ 3.  3.  3. nan nan nan nan  2.  2.  2.  1.  1. nan nan nan  4.  4.]

I have solved your problem under some restrictions我已经在一些限制下解决了你的问题

  1. Instead of NaN, I have used zeros as separators我使用零作为分隔符,而不是 NaN
  2. I assumed that an array of yours ALWAYS starts with a sequence of non-zero integers and ends with another sequence of non-zero integers.我假设你的数组总是以非零整数序列开始,以另一个非零整数序列结束。

With these provisions, I have essentially shuffled a representation of the sequences of integers, and later I have stitched everything in place again.有了这些规定,我基本上打乱了整数序列的表示,后来我又把所有东西缝合到位。

In [102]: import numpy as np
     ...: from itertools import groupby
     ...: a = np.array([int(_) for _ in '1110022220003044440005500000600777'])
     ...: print(a)
     ...: n, z = [], []
     ...: for i,g in groupby(a):
     ...:     if i:
     ...:         n.append((i, sum(1 for _ in g)))
     ...:     else:
     ...:         z.append(sum(1 for _ in g))
     ...: np.random.shuffle(n)
     ...: nn = n[0]
     ...: b = [*[nn[0]]*nn[1]]
     ...: for zz, nn in zip(z, n[1:]):
     ...:     b += [*[0]*zz, *[nn[0]]*nn[1]]
     ...: print(np.array(b))
[1 1 1 0 0 2 2 2 2 0 0 0 3 0 4 4 4 4 0 0 0 5 5 0 0 0 0 0 6 0 0 7 7 7]
[7 7 7 0 0 1 1 1 0 0 0 4 4 4 4 0 6 0 0 0 5 5 0 0 0 0 0 2 2 2 2 0 0 3]

Note笔记

The lengths of the runs of separators in the shuffled array is exactly the same as in the original array, but shuffling also the separators is easy.混洗后的数组中分隔符的长度与原始数组中的完全相同,但混洗分隔符也很容易。 A more difficult problem would be to change arbitrarily the lengths, keepin' the array length unchanged.一个更困难的问题是任意改变长度,保持数组长度不变。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM