[英]Random shuffle with weight in python
I am currently trying to shuffle an array and am running into some problems.我目前正在尝试对数组进行洗牌,但遇到了一些问题。
What I have:我拥有的:
my_array=array([nan, 1, 1, nan, nan, 2, nan, ..., nan, nan, nan])
What I want to do:我想做的事:
I want to shuffle the dataset while keeping the numbers (eg the 1,1
in the array) together.我想打乱数据集,同时将数字(例如数组中的1,1
)保持在一起。 What I did is first converting every nan
into an unique negative number.我所做的是首先将每个nan
转换为唯一的负数。
my_array=array([-1, 1, 1, -2, -3, 2, -4, ..., -2158, -2159, -2160])
Afterward I split everything up with pandas:之后我把所有东西都用熊猫分开了:
df = pd.DataFrame(my_array)
df.rename(columns={0: 'sampleID'}, inplace=True)
groups = [df.iloc[:, 0] for _, df in df.groupby('sampleID')]
If I know shuffle my dataset I will have an equal probability for every group to appear at a given place, but this would neglect the number of elements in each group.如果我知道 shuffle 我的数据集,我将有相同的概率让每个组出现在给定的位置,但这会忽略每个组中的元素数量。 If I have a group of several elements like [9,9,9,9,9,9]
it should have a higher chance at appearing earlier than some random nan
.如果我有一组像[9,9,9,9,9,9]
这样的几个元素,它应该比一些随机的nan
更早出现。 Correct me on this one if I'm wrong.如果我错了,请纠正我。
One way to get around this problem is numpys choice method.解决这个问题的一种方法是 numpys 选择方法。 For this I have to create a probability array为此,我必须创建一个概率数组
probability_array = np.zeros(len(groups))
for index, item in enumerate(groups):
probability_array[index] = len(item) / len(groups)
All of this to finally call:所有这一切最终调用:
groups=np.array(groups,dtype=object)
rng = np.random.default_rng()
shuffled_indices = rng.choice(len(groups), len(groups), replace=False, p=probability_array)
shuffled_array = np.concatenate(groups[shuffled_indices]).ravel()
shuffled_array[shuffled_array < 1] = np.NaN
All of this is quite cumbersome and not very fast.所有这些都非常麻烦,而且速度不是很快。 Besides the fact that you can certainly code it better, I feel like I am missing some very simple solution to my problem.除了您当然可以更好地编写代码这一事实之外,我觉得我缺少一些非常简单的问题解决方案。 Can somebody point me in the right direction?有人可以指出我正确的方向吗?
One approach:一种方法:
import numpy as np
from itertools import groupby
# toy data
my_array = np.array([np.nan, 1, 1, np.nan, np.nan, 2, 2, 2, np.nan, 3, 3, 3, np.nan, 4, 4, np.nan, np.nan])
# find groups
groups = np.array([[key, sum(1 for _ in group)] for key, group in groupby(my_array)])
# permute
keys, repetitions = zip(*np.random.permutation(groups))
# recreate new array
res = np.repeat(keys, repetitions)
print(res)
Output (single run)输出(单次运行)
[ 3. 3. 3. nan nan nan nan 2. 2. 2. 1. 1. nan nan nan 4. 4.]
I have solved your problem under some restrictions我已经在一些限制下解决了你的问题
With these provisions, I have essentially shuffled a representation of the sequences of integers, and later I have stitched everything in place again.有了这些规定,我基本上打乱了整数序列的表示,后来我又把所有东西缝合到位。
In [102]: import numpy as np
...: from itertools import groupby
...: a = np.array([int(_) for _ in '1110022220003044440005500000600777'])
...: print(a)
...: n, z = [], []
...: for i,g in groupby(a):
...: if i:
...: n.append((i, sum(1 for _ in g)))
...: else:
...: z.append(sum(1 for _ in g))
...: np.random.shuffle(n)
...: nn = n[0]
...: b = [*[nn[0]]*nn[1]]
...: for zz, nn in zip(z, n[1:]):
...: b += [*[0]*zz, *[nn[0]]*nn[1]]
...: print(np.array(b))
[1 1 1 0 0 2 2 2 2 0 0 0 3 0 4 4 4 4 0 0 0 5 5 0 0 0 0 0 6 0 0 7 7 7]
[7 7 7 0 0 1 1 1 0 0 0 4 4 4 4 0 6 0 0 0 5 5 0 0 0 0 0 2 2 2 2 0 0 3]
Note笔记
The lengths of the runs of separators in the shuffled array is exactly the same as in the original array, but shuffling also the separators is easy.混洗后的数组中分隔符的长度与原始数组中的完全相同,但混洗分隔符也很容易。 A more difficult problem would be to change arbitrarily the lengths, keepin' the array length unchanged.一个更困难的问题是任意改变长度,保持数组长度不变。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.