如何递归拆分人群？

Question

I'm trying to split a population of Xs (continuous) and Ys (binary) equally (by count) until a "breakpoint" is found. 我试图将Xs（连续）和Ys（二进制）（按计数）均分，直到找到“断点”。 For example, the below code should generate 5,000 observations with each half having a different proportion of 0s and 1s. 例如，以下代码应生成5,000个观察值，每个观察值的比例分别为0和1。 I want to then split the half with the larger proportion of 1s and so on and so forth until there is no way to split anymore. 然后，我想用较大的1s拆分一半，依此类推，直到没有办法拆分为止。

EDIT: My data is not normally distributed but I had to generate fake data for this example. 编辑：我的数据不是正态分布的，但对于此示例，我不得不生成伪造的数据。

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random

random.seed(191)
df = pd.DataFrame( np.random.randint( 0,2,size = ( 5000,1 ) ), columns = list( 'Y' ) )
df['X'] = pd.Series( random.choices( range( 5000 ), k = 5000) )

# Creating equal-sized bins
df['bins'] = pd.qcut( df['X'], 2 )
print( df.groupby('bins')['Y'].value_counts() )
print( df.groupby('bins')['Y'].mean() )

# Next I want to take the bins with the larger proportion of 1s and repeat the qcut until a minimum/maximum(?) is reached

Answer 1

You can do what you want with the code: 您可以使用以下代码执行所需的操作：

import numpy as np
import pandas as pd
import random

SIZE = 5000

df = pd.DataFrame(np.random.randint(0, 2, size=(SIZE, 1)), columns=list('Y'))
df['X'] = pd.Series(random.choices(range(5000), k=SIZE))


def splitting(df):

    # base case - no way to split anymore - only 0s or only 1s are in 'Y'
    if df['Y'].unique().shape[0] == 1:
        return df
    # recursion
    else:
        df['bins'] = pd.qcut(df['X'], 2)
        label = df.groupby('bins')['Y'].mean().idxmax()
        df_1 = df[df['bins'] != label].copy()
        df_2 = df[df['bins'] == label].copy()
        return pd.concat([df_1, splitting(df_2)])


result = splitting(df)

如何递归拆分人群？

问题描述

1 个解决方案

解决方案1
0 2018-10-08 14:15:20

如何递归拆分人群？

问题描述

1 个解决方案

解决方案1 0 2018-10-08 14:15:20

解决方案1
0 2018-10-08 14:15:20