Pandas dataframe, select n 基于唯一值数量的随机行

Question

I'm working on a text classification problem that trains well but my categories are quite imbalanced, hindering results.我正在研究一个训练良好的文本分类问题，但我的类别非常不平衡，阻碍了结果。 The largest 2 categories are over 80x larger than the smallest category, so an unfair amount of the classifications go to those 2 categories.最大的 2 个类别比最小的类别大 80 倍以上，因此 go 对这 2 个类别的分类数量不公平。 I need to select n rows (arbitrarily large) from each category.我需要每个类别的 select n行（任意大）。 My dataset is quite large (10m rows, 1k unique categories).我的数据集非常大（10m 行，1k 个唯一类别）。

Let's say the dataframe is:假设 dataframe 是：

data = {
    'category':['2','2','2','2','4','4','4','4','4','4','6','6','6'],
    'text':['t1','t2','t3','t4','t5','t6','t7','t8','t9','t10','t11','t12','t13']
}

df = pd.DataFrame(data)

How could I select n random rows per category?我怎么能 select 每个类别有n随机行？

I have tried to find some way to use np.random.choice to select n random rows but I can't find a way to grab that index for a drop by index.我试图找到某种方法来使用np.random.choice到 select n随机行，但我找不到一种方法来获取该索引以逐个索引。

The ideal output for n = 3 would be something like: n = 3的理想 output 将类似于：

>>> df.head(9)
    category    text
0   2           t3
1   6           t11
2   6           t13
3   4           t6
4   2           t1
5   4           t9
6   4           t8
7   2           t4
8   6           t12

Answer 1

You can use sample and groupby().head() :您可以使用sample和groupby().head() ：

df.sample(frac=1).groupby('category').head(3)

Output: Output：

   category text
4         4   t5
12        6  t13
1         2   t2
8         4   t9
9         4  t10
3         2   t4
10        6  t11
0         2   t1
11        6  t12

Pandas dataframe, select n 基于唯一值数量的随机行

问题描述

1 个解决方案

解决方案1
5 已采纳 2020-05-05 17:33:38

Pandas dataframe, select n 基于唯一值数量的随机行

问题描述

1 个解决方案

解决方案1 5 已采纳 2020-05-05 17:33:38

解决方案1
5 已采纳 2020-05-05 17:33:38