[英]Randomly select unique row from dataframe in Pandas
Say I have a dataframe of the form where rn
is the row index 说我有一个形式的数据框,其中rn
是行索引
A1 | A2 | A3
-----------------
r1 x | 0 | t
r2 y | 1 | u
r3 z | 1 | v
r4 x | 2 | w
r5 z | 2 | v
r6 x | 2 | w
If I wanted to subset this dataframe such that the column A2 has only unique values, I'd use df.drop_duplicates('A2')
. 如果我想对该数据帧进行子集处理,以使A2列仅具有唯一值,则可以使用df.drop_duplicates('A2')
。 However, that'd keep only the first row of the unique value and delete the rest. 但是,这将仅保留唯一值的第一行,并删除其余的唯一行。 For this example, only r2 and r4 will be in the subset. 对于此示例,只有r2和r4将在子集中。
What I want is that any of the rows with duplicate values are selected randomly rather than the first row. 我想要的是任何具有重复值的行都是随机选择的,而不是第一行。 So for this example, for A2 == 1
, r2 or r3 is selected randomly or for A2 == 2
any of r4, r5 or r6 is selected randomly. 因此,对于此示例,对于A2 == 1
,随机选择r2或r3,或者对于A2 == 2
,随机选择r4,r5或r6中的任何一个。 How would I go about implementing this? 我将如何实施呢?
Shuffle the DataFrame first and then drop the duplicates: 首先随机播放DataFrame,然后删除重复项:
df.sample(frac=1).drop_duplicates(subset='A2')
If the order of the rows is important you can use sort_index
as @cᴏʟᴅsᴘᴇᴇᴅ suggested: 如果行的顺序很重要,则可以按@cᴏʟᴅsᴘᴇᴇᴅ建议使用sort_index
:
df.sample(frac=1).drop_duplicates(subset='A2').sort_index()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.