简体   繁体   English

Pandas 样本来自 df 保持组的平衡

[英]Pandas sample from df keeping balance of groups

Lets generate some dataframe:让我们生成一些 dataframe:

import pandas as pd
categs = ['cat1'] * 600 + ['cat2'] * 300 + ['cat3'] * 100
subcats = ['sub1', 'sub2', 'sub2', 'sub3', 'sub3', 'sub4', 'sub4', 'sub4', 'sub4', 'sub4'] * 100
subcats[0] = 'subX'
vals = range(1000)
df = pd.DataFrame({
   'category': categs,
   'subcategory': subcats,
   'values': vals
})

So let's look at amount of rows by category and subcategory:因此,让我们按类别和子类别查看行数:

print(df.groupby(['category', 'subcategory']).size())

we got我们有

>>>
category  subcategory
cat1      sub1            59
          sub2           120
          sub3           120
          sub4           300
          subX             1
cat2      sub1            30
          sub2            60
          sub3            60
          sub4           150
cat3      sub1            10
          sub2            20
          sub3            20
          sub4            50
dtype: int64

This is a dataframe of 1000 elements.这是 1000 个元素的 dataframe。 There are 600 elements of cat1, 300 of cat2 and 100 of cat3. cat1 有 600 个元素,cat2 有 300 个元素,cat3 有 100 个元素。 What I want is to reduce this dataframe from 1000 to let's say 60 rows so我想要的是将这个 dataframe 从 1000 减少到假设 60 行
1) each category has same amount of rows (20 in our case, which equals 60 / (number of categories) ) 1)每个类别都有相同数量的行(在我们的例子中为 20,等于60 /(类别数)
2) proportion of each subcategory in a category is kept 2)一个类别中每个子类别的比例保持不变
3) if we have small number of subcategory items it still stays in category (there is only one 'subX' in cat1, we need to keep it even if it's proportion was 1/600 for cat1). 3)如果我们有少量的子类别项目,它仍然保留在类别中(在cat1中只有一个'subX',即使它的比例是cat1的1/600,我们也需要保留它)。

So when we create our new df I would like to receive something like this:因此,当我们创建新的 df 时,我希望收到如下信息:

print(newdf.groupby(['category', 'subcategory']).size())


category  subcategory
cat1      sub1            2
          sub2           4
          sub3           4
          sub4           10
          subX             1
cat2      sub1            2
          sub2            4
          sub3            4
          sub4           10
cat3      sub1            2
          sub2            4
          sub3            4
          sub4            10
dtype: int64

In this case there are 21 element for cat1, but it is not a big deal, the main idea is that proportion of subcategories are saved and amount of rows is around targeted number 20.在这种情况下,cat1 有 21 个元素,但这没什么大不了的,主要思想是节省了子类别的比例,并且行数在目标数字 20 左右。

You can find the number of rows that you should keep per subcategory, and keep only the rows with cumcount below that number:您可以找到每个子类别应保留的行数,并仅保留cumcount低于该数字的行:

# total (approximate) number of rows to keep
n = 60

# number of rows per category
n_per_cat = n / df['category'].nunique()

# number of rows per subcategory
g_subcat = df.groupby(['category', 'subcategory'])
z = g_subcat['category'].size()
n_per_subcat = np.ceil(z / z.sum(level=0) * n_per_cat)

# output
df_out = (df
          .assign(i=g_subcat.cumcount())
          .merge(n_per_subcat.rename('n').reset_index())
          .query('i < n')
          .drop(columns=['i', 'n']))

# test
df_out.groupby(['category', 'subcategory']).size()

Output: Output:

category  subcategory
cat1      sub1            2
          sub2            4
          sub3            4
          sub4           10
          subX            1
cat2      sub1            2
          sub2            4
          sub3            4
          sub4           10
cat3      sub1            2
          sub2            4
          sub3            4
          sub4           10

PS And to make it random, you can, of course, shuffle the dataframe before all this with: PS为了让它随机,你当然可以在所有这些之前洗牌 dataframe :

df = df.sample(frac=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM