繁体   English   中英

numpy中的分层抽样

[英]stratified sampling in numpy

在numpy中,我有一个像这样的数据集。 前两列是索引。 我可以通过索引将数据集划分为多个块,即第一个块为0 0第二个块为0 1第三个块0 2然后为1 0、1 1、1 2,依此类推。 每个块至少具有两个元素。 索引列中的数字可以变化

我需要沿着这些块随机地将数据集拆分为80%-20%,这样在拆分之后,两个数据集中的每个块都至少包含1个元素。 我该怎么办?

indices | real data
        |
0   0   | 43.25 665.32 ...  } 1st block
0   0   | 11.234            }
0   1     ...               } 2nd block
0   1                       } 
0   2                       } 3rd block
0   2                       }
1   0                       } 4th block
1   0                       }
1   0                       }
1   1                       ...
1   1                       
1   2
1   2
2   0
2   0 
2   1
2   1
2   1
...

看看你对此感觉如何。 为了引入随机性,我将对整个数据集进行改组。 这是我弄清楚如何进行矢量化分割的唯一方法。 也许您可以简单地改组索引数组,但这对我的大脑来说是一个过多的间接。 我还使用了结构化数组,以方便提取块。 首先,让我们创建一个样本数据集:

from __future__ import division
import numpy as np

# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)

items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)

dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
                             ('data', np.float)])
dataset['idx1'][:2*c1*c2] =  np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3

现在进行分层抽样:

# For randomness, shuffle the entire array
np.random.shuffle(dataset)

blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))

# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)

x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B

a_idx = threshold > np.random.rand(len(dataset))

A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]

运行它之后,拆分大约为80/20,所有块都在两个数组中表示:

>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True

这是一个替代解决方案。 如果有可能以一种更加麻木的方式(没有for循环)来实现它,我愿意接受代码审查。 @Jamie的答案确实很好,只是有时它会在数据块内产生歪斜的比率。

    ratio = 0.8
    IDX1 = 0
    IDX2 = 1
    idx1s = np.arange(len(np.unique(self.data[:,IDX1])))
    idx2s = np.arange(len(np.unique(self.data[:,IDX2])))
    valid = None
    train = None
    for i1 in idx1s:
        for i2 in idx2:
            mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
            curr_data = data[mask,:]
            np.random.shuffle(curr_data)
            start = np.min(mask)
            end = np.max(mask)
            thres = start + np.around((end - start) * ratio).astype(np.int)

            selected = mask < thres
            train_idx = mask[0][selected[0]]
            valid_idx = mask[0][~selected[0]]
            if train != None:
                train = np.vstack((train,data[train_idx]))
                valid = np.vstack((valid,data[valid_idx]))
            else:
                train = data[train_idx]
                valid = data[valid_idx]

我假设每个块至少有两个条目,并且如果有两个以上,则希望它们尽可能接近80/20。 最简单的方法似乎是为所有行分配一个随机数,然后根据每个分层样本中的百分位数进行选择。 说这是文件strat_sample.csv中的数据:

Index_1,Index_2,Data_1,Data_2
0,0,0.614583182,0.677644482
0,0,0.321384981,0.598450854
0,0,0.303029607,0.300593782
0,0,0.646010758,0.612006715
0,0,0.484572883,0.30052535
0,1,0.010625416,0.118671475
0,1,0.428967984,0.23795173
0,1,0.523440618,0.457275922
0,1,0.379612652,0.337640868
0,1,0.338180659,0.206399031
1,0,0.079386,0.890939911
1,0,0.572864624,0.725615079
1,0,0.045891404,0.300128917
1,0,0.578792198,0.100698871
1,0,0.776485138,0.475135948
1,0,0.401850419,0.784835723
1,1,0.087660923,0.497299605
1,1,0.8460978,0.825774802
1,1,0.526015021,0.581905971
1,1,0.23324672,0.299475291

然后此代码(使用Pandas数据结构)按需工作

import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow

def TreatmentOneCount(n , *args):
    #assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1. 
    OptimalRatio = args[0]
    if n < 2:
        print("N too small, assignment not defined.")
        a = NaN
    elif n == 2:
        a = 1
    else:
        """
        There are one of two numbers that are close to the target ratio, one above, the other below
        If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
        If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
        """
        targetassigment = OptimalRatio * n
        if  targetassigment - floor(targetassigment) > 0.5:
            a = min(ceil(targetassigment),n-1)
        else:
            a = max(floor(targetassigment),1)
    return a


df = pd.read_csv('strat_sample.csv', sep=','  , header=0)

#assign a random number to each entry
df['RandScore'] =  np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)

#Within each block assign a rank based on random number. 
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()

#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)

#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count()
seriestest.name = "Counts"
dftest = pd.DataFrame(seriestest)

#Add the block counts to the data
df = df.merge(dftest, how='left',  left_on = 'MasterIdx', right_index= True)

#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <=  df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)
from sklearn import cross_validation

X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM