在pandas中groupby之后选择样本随机组？

Question

我有一个非常大的 DataFrame，看起来像这个示例 df：

df = 

col1    col2     col3 
apple   red      2.99 
apple   red      2.99 
apple   red      1.99 
apple   pink     1.99 
apple   pink     1.99 
apple   pink     2.99 
...     ....      ...
pear    green     .99 
pear    green     .99 
pear    green    1.29

我按这样的 2 列分组：

g = df.groupby(['col1', 'col2'])

现在我想选择 3 个随机组。 所以我的预期输出是这样的：

col1    col2     col3 
apple   red      2.99 
apple   red      2.99 
apple   red      1.99 
pear    green     .99 
pear    green     .99 
pear    green    1.29
lemon   yellow    .99 
lemon   yellow    .99 
lemon   yellow   1.99

（假设以上三个组是来自 df 的随机组）。 我怎样才能做到这一点？ 我用过这个。 但这对我的情况没有帮助。

Answer 1

你可以用shuffle和ngroup

g = df.groupby(['col1', 'col2'])

a=np.arange(g.ngroups)
np.random.shuffle(a)

df[g.ngroup().isin(a[:2])]# change 2 to what you need :-)

Answer 2

使用sample打乱您的数据帧，然后执行非排序groupby ：

df = df.sample(frac=1)
df2 = pd.concat(
    [g for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
    ignore_index=True 
)

如果您需要每组前 3 个，请使用groupby.head(3) ；

df2 = pd.concat(
    [g.head(3) for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
    ignore_index=True 
)

Answer 3

如果您只需要在一列中进行这种类型的采样，这也是一种替代方法：

df.loc[df['col1'].isin(pd.Series(df['col1'].unique()).sample(2))]

更长：

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
                      'col2': np.random.randint(5, size=9),
                      'col3': np.random.randint(5, size=9)
                     })
>>> df
  col1  col2  col3
0    a     4     3
1    a     3     0
2    a     4     0
3    b     4     4
4    b     4     1
5    b     1     3
6    c     4     4
7    c     3     2
8    c     3     1
>>> sample = pd.Series(df['col1'].unique()).sample(2)
>>> sample
0    b
1    c
dtype: object
>>> df.loc[df['col1'].isin(sample)]
  col1  col2  col3
3    b     4     4
4    b     4     1
5    b     1     3
6    c     4     4
7    c     3     2
8    c     3     1

Answer 4

这是一种方式：

from io import StringIO
import pandas as pd
import numpy as np

np.random.seed(100)

data = """
col1    col2     col3
apple   red      2.99
apple   red      2.99
apple   red      1.99
apple   pink     1.99
apple   pink     1.99
apple   pink     2.99
pear    green     .99
pear    green     .99
pear    green    1.29
"""
# Number of groups
K = 2

df = pd.read_table(StringIO(data), sep=' ', skip_blank_lines=True, skipinitialspace=True)
# Use columns as indices
df2 = df.set_index(['col1', 'col2'])
# Choose random sample of indices
idx = np.random.choice(df2.index.unique(), K, replace=False)
# Select
selection = df2.loc[idx].reset_index(drop=False)
print(selection)

输出：

    col1   col2  col3
0  apple   pink  1.99
1  apple   pink  1.99
2  apple   pink  2.99
3   pear  green  0.99
4   pear  green  0.99
5   pear  green  1.29

Answer 5

我把@Arvid Baarnhielm 的答案变成了一个简单的函数

def sampleCluster(df:pd.DataFrame, columnCluster:str, fraction) -> pd.DataFrame:
    return df.loc[df[columnCluster].isin(pd.Series(df[columnCluster].unique()).sample(frac=fraction))]

Answer 6

本着这个答案的精神，一个简单的解决方案

n_groups = 2    
df.merge(df[['col1','col2']].drop_duplicates().sample(n=n_groups))

在pandas中groupby之后选择样本随机组？

问题描述

6 个解决方案

解决方案1
8 已采纳 2018-04-24 15:13:51

解决方案2
4 2018-04-24 15:01:41

解决方案3
2 2018-09-20 14:47:14

解决方案4
1 2018-04-24 15:01:58

解决方案5
0 2021-02-17 10:06:37

解决方案6
0 2021-07-15 05:02:29

在pandas中groupby之后选择样本随机组？

问题描述

6 个解决方案

解决方案1 8 已采纳 2018-04-24 15:13:51

解决方案2 4 2018-04-24 15:01:41

解决方案3 2 2018-09-20 14:47:14

解决方案4 1 2018-04-24 15:01:58

解决方案5 0 2021-02-17 10:06:37

解决方案6 0 2021-07-15 05:02:29

解决方案1
8 已采纳 2018-04-24 15:13:51

解决方案2
4 2018-04-24 15:01:41

解决方案3
2 2018-09-20 14:47:14

解决方案4
1 2018-04-24 15:01:58

解决方案5
0 2021-02-17 10:06:37

解决方案6
0 2021-07-15 05:02:29