简体   繁体   English

从熊猫的数据框中随机选择唯一的行

[英]Randomly select unique row from dataframe in Pandas

Say I have a dataframe of the form where rn is the row index 说我有一个形式的数据框,其中rn是行索引

       A1  |  A2 |  A3 
      -----------------
r1     x   |  0  |  t
r2     y   |  1  |  u
r3     z   |  1  |  v
r4     x   |  2  |  w
r5     z   |  2  |  v
r6     x   |  2  |  w

If I wanted to subset this dataframe such that the column A2 has only unique values, I'd use df.drop_duplicates('A2') . 如果我想对该数据帧进行子集处理,以使A2列仅具有唯一值,则可以使用df.drop_duplicates('A2') However, that'd keep only the first row of the unique value and delete the rest. 但是,这将仅保留唯一值的第一行,并删除其余的唯一行。 For this example, only r2 and r4 will be in the subset. 对于此示例,只有r2和r4将在子集中。

What I want is that any of the rows with duplicate values are selected randomly rather than the first row. 我想要的是任何具有重复值的行都是随机选择的,而不是第一行。 So for this example, for A2 == 1 , r2 or r3 is selected randomly or for A2 == 2 any of r4, r5 or r6 is selected randomly. 因此,对于此示例,对于A2 == 1 ,随机选择r2或r3,或者对于A2 == 2 ,随机选择r4,r5或r6中的任何一个。 How would I go about implementing this? 我将如何实施呢?

Shuffle the DataFrame first and then drop the duplicates: 首先随机播放DataFrame,然后删除重复项:

df.sample(frac=1).drop_duplicates(subset='A2')

If the order of the rows is important you can use sort_index as @cᴏʟᴅsᴘᴇᴇᴅ suggested: 如果行的顺序很重要,则可以按@cᴏʟᴅsᴘᴇᴇᴅ建议使用sort_index

df.sample(frac=1).drop_duplicates(subset='A2').sort_index()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM