[英]Pandas dataframe, select n random rows based on number of unique values
I'm working on a text classification problem that trains well but my categories are quite imbalanced, hindering results.我正在研究一个训练良好的文本分类问题,但我的类别非常不平衡,阻碍了结果。 The largest 2 categories are over 80x larger than the smallest category, so an unfair amount of the classifications go to those 2 categories.
最大的 2 个类别比最小的类别大 80 倍以上,因此 go 对这 2 个类别的分类数量不公平。 I need to select
n
rows (arbitrarily large) from each category.我需要每个类别的 select
n
行(任意大)。 My dataset is quite large (10m rows, 1k unique categories).我的数据集非常大(10m 行,1k 个唯一类别)。
Let's say the dataframe is:假设 dataframe 是:
data = {
'category':['2','2','2','2','4','4','4','4','4','4','6','6','6'],
'text':['t1','t2','t3','t4','t5','t6','t7','t8','t9','t10','t11','t12','t13']
}
df = pd.DataFrame(data)
How could I select n
random rows per category?我怎么能 select 每个类别有
n
随机行?
I have tried to find some way to use np.random.choice
to select n
random rows but I can't find a way to grab that index for a drop by index.我试图找到某种方法来使用
np.random.choice
到 select n
随机行,但我找不到一种方法来获取该索引以逐个索引。
The ideal output for n = 3
would be something like: n = 3
的理想 output 将类似于:
>>> df.head(9)
category text
0 2 t3
1 6 t11
2 6 t13
3 4 t6
4 2 t1
5 4 t9
6 4 t8
7 2 t4
8 6 t12
You can use sample
and groupby().head()
:您可以使用
sample
和groupby().head()
:
df.sample(frac=1).groupby('category').head(3)
Output: Output:
category text
4 4 t5
12 6 t13
1 2 t2
8 4 t9
9 4 t10
3 2 t4
10 6 t11
0 2 t1
11 6 t12
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.