[英]How can I select rows randomly in proportion to the number of unique values for each group in Python?
I would like to random select rows proportionate to the number of unique values in column "ID" grouping by column "Team".我想随机 select 行与按“团队”列分组的“ID”列中唯一值的数量成比例。 Further, I would like to only retrieve 8 total rows.
此外,我只想检索 8 行。 I have:
我有:
| ID | Team | Color |
| ----- | ----- | ------------ |
| 1 | A | Blue |
| 2 | B | Red |
| 2 | B | Green |
| 3 | A | Blue |
| 6 | C | Red |
| 1 | B | Yellow |
| 2 | B | Green |
| 9 | A | Blue |
| 6 | C | Red |
| 1 | B | Yellow |
| 9 | A | Blue |
| 1 | A | Purple |
Only the proportions are looking at unique values.只有比例在看独特的价值。 The rows pulled do not necessarily need to be unique in anyway.
无论如何,拉出的行不一定必须是唯一的。 Using the above table the proportions would be:
使用上表,比例将是:
| Team | Unique IDs | Proportion | Number selected |
| ------ | ---------- | ----------- | ---------------- |
| A | 3 | 0.500 | 4 |
| B | 2 | 0.333 | 3 |
| C | 1 | 0.167 | 1 |
So since I want 8 total rows selected proportionately, I should end up with something like the following:因此,由于我希望按比例选择 8 行,我最终应该得到如下内容:
| ID | Team | Color |
| ----- | ----- | ------------ |
| 1 | A | Blue |
| 3 | A | Blue |
| 9 | A | Blue |
| 1 | A | Purple |
| 2 | B | Green |
| 2 | B | Red |
| 1 | B | Yellow |
| 6 | C | Red |
'ID'
s in each 'Team'
group,'Team'
组中唯一'ID'
的数量,.sample
this many elements from each 'Team'
group:'Team'
组中的这么多元素进行.sample
:n_total = 8
unique_counts = df.groupby('Team')['ID'].agg('nunique')
nums_selected= np.floor(unique_counts / unique_counts.sum() * n_total).astype(int) # rounded down
df.groupby('Team', group_keys=False).apply( # for each 'Team' group:
lambda x: x.sample(n=nums_selected[x.name], # sample this many rows
replace=True) # (with replacement)
)
Note:
笔记:
The result can contain less elements than n_total because nums_selected are rounded down when converting from
float
toint
.结果可以包含比n_total更少的元素,因为nums_selected在从
float
转换为int
时会向下舍入。 However, you may use any method to do this conversion:np.ceil
,pd.Series.round
, or any other method you choose to your liking.但是,您可以使用任何方法进行此转换:
np.ceil
、pd.Series.round
或您选择的任何其他方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.