[英]Using Python Pandas how to use stratified random sampling where assigning percentage as required for sampling
我有一个农民组和 ID 的数据集。 我必须使用分层随机抽样从 18 个农民中选择 6 个农民,其中给出了抽样百分比。
分组百分比如下
日期设置:
现在,使用抽样,我必须选择 6 个农民,其中 6x0.50=3 组农民:“M,SC”,6x0.25=2 组 F,SC 农民和 1 组 M,ST 农民将被选中.
这是我到目前为止所拥有的:
df
Out[41]:
Group ID
0 M,SC 1
1 M,SC 2
2 M,SC 3
3 M,SC 4
4 M,SC 5
5 F,SC 6
6 F,SC 7
7 F,SC 8
8 F,SC 9
9 M,ST 10
10 M,ST 11
11 M,ST 12
12 M,ST 13
13 M,ST 14
14 F,ST 15
15 F,ST 16
16 F,ST 17
17 F,ST 18
N=6
df.groupby('Group', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)
Out[43]:
Group ID
0 M,ST 14
1 M,SC 3
2 M,ST 10
3 M,SC 2
4 F,ST 15
5 F,SC 7
现在,我被困在如何在采样中应用给定的 %,例如 M、SC 组:50%、F、SC 组:25%、M、ST 组:20% 和 F、ST 组 5%,以上代码按比例选择 N=6 的样本。
以下代码用于解决问题
import pandas as pd
import numpy as np
df['Proportion'] = df['Group'].replace(['M,SC','F,SC','M,ST','F,ST'],['0.5','0.25','0.2','0.05'])
df['Proportion'] = df['Proportion'].astype('float')
df['Sample']=round(df['Proportion']*6,0)
df['Selected Farmers_ID'] = df['Sample'].apply(np.ceil).astype(int)
df['Selected Farmers_ID'] = df.groupby('Group').apply(lambda df: df['ID'].sample(df['Selected Farmers_ID'].iat[0])).reset_index(level=0)['ID']
df['Selected Farmers_ID'] = df['Selected Farmers_ID'].fillna('')
df['Selected Farmers_ID'].replace('', pd.np.nan, inplace=True)
df.dropna(subset=['Selected Farmers_ID'], inplace=True)
df
Out[11]:
Group ID Proportion Sample Selected Farmers_ID
1 M,SC 2 0.50 3.0 2.0
3 M,SC 4 0.50 3.0 4.0
4 M,SC 5 0.50 3.0 5.0
5 F,SC 6 0.25 2.0 6.0
8 F,SC 9 0.25 2.0 9.0
12 M,ST 13 0.20 1.0 13.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.