來自熊貓的分層樣本

Question

我有一個pandas DataFrame，看起來大致如下：

cli_id | X1 | X2 | X3 | ... | Xn |  Y  |
----------------------------------------
123    | 1  | A  | XX | ... | 4  | 0.1 |
456    | 2  | B  | XY | ... | 5  | 0.2 |
789    | 1  | B  | XY | ... | 5  | 0.3 |
101    | 2  | A  | XX | ... | 4  | 0.1 |
...

我有客戶端ID，很少有分類屬性，Y是事件的概率，其值從0到1乘以0.1。

我需要在每個組（10倍）的大小為200的Y中采取分層樣本

在分成火車/測試時，我經常使用它來分層樣本：

def stratifiedSplit(X,y,size):
    sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0)

    for train_index, test_index in sss:
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    return X_train, X_test, y_train, y_test

但在這種情況下我不知道如何修改它。

Answer 1

如果每個組的樣本數相同，或者每個組的比例是恆定的，您可以嘗試類似的方法

df.groupby('Y').apply(lambda x: x.sample(n=200))

要么

df.groupby('Y').apply(lambda x: x.sample(frac=.1))

要針對多個變量執行分層抽樣，只需對更多變量進行分組。 為此可能需要構造新的分箱變量。

但是，如果組大小太小，比例如groupize 1和propotion .25，則不會返回任何項目。 這是由於pythons舍入int函數int(0.25)=0

Answer 2

我不完全確定你的意思是：

strats = []
for k in range(11):
    y_val = k*0.1
    dummy_df = your_df[your_df['Y'] == y_val]
    stats.append( dummy_df.sample(200) )

這使得虛擬數據幀僅包含您想要的Y值，然后采樣200。

好的，所以你需要不同的塊來擁有相同的結構。 我想這有點難，我就是這樣做的：

首先，我會得到X1的直方圖：

hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins))

我們現在有一個帶有nbins箱的直方圖。

現在的策略是根據X1的值來繪制一定數量的行。 我們將從具有更多觀察值的箱子中抽取更多，並且從更少的箱子中抽取更多，從而保留X的結構。

特別是，每個垃圾箱的相對貢獻應該是：

rel = [float(i) / sum(hist) for i in hist]

這將是[0.1, 0.2, 0.1, 0.3, 0.3]

如果我們想要200個樣本，我們需要繪制：

draws_in_bin = [int(i*200) for i in rel]

現在我們知道從每個箱子中抽取多少觀察：

strats = []
for k in range(11):
        y_val = k*0.1

        #get a dataframe for every value of Y
        dummy_df = your_df[your_df['Y'] == y_val]

        bin_strat = []
        for left_edge, right_edge, n_draws in zip(edges[:-1], edges[1:], draws_in_bin):

             bin_df = dummy_df[ (dummy_df['X1']> left_edge) 
                              & (dummy_df['X1']< right_edge) ]

             bin_strat.append(bin_df.sample(n_draws))
             # this takes the right number of draws out 
             # of the X1 bin where we currently are
             # Note that every element of bin_strat is a dataframe
             # with a number of entries that corresponds to the 
             # structure of draws_in_bin
        #
        #concatenate the dataframes for every bin and append to the list
        strats.append( pd.concat(bin_strat) )

來自熊貓的分層樣本

問題描述

2 個解決方案

解決方案1
14 2016-12-08 09:38:25

解決方案2
4 已采納 2016-12-08 08:56:29

來自熊貓的分層樣本

問題描述

2 個解決方案

解決方案1 14 2016-12-08 09:38:25

解決方案2 4 已采納 2016-12-08 08:56:29

解決方案1
14 2016-12-08 09:38:25

解決方案2
4 已采納 2016-12-08 08:56:29