从熊猫的数据框中随机选择唯一的行

Question

Say I have a dataframe of the form where rn is the row index 说我有一个形式的数据框，其中rn是行索引

       A1  |  A2 |  A3 
      -----------------
r1     x   |  0  |  t
r2     y   |  1  |  u
r3     z   |  1  |  v
r4     x   |  2  |  w
r5     z   |  2  |  v
r6     x   |  2  |  w

If I wanted to subset this dataframe such that the column A2 has only unique values, I'd use df.drop_duplicates('A2') . 如果我想对该数据帧进行子集处理，以使A2列仅具有唯一值，则可以使用df.drop_duplicates('A2') 。 However, that'd keep only the first row of the unique value and delete the rest. 但是，这将仅保留唯一值的第一行，并删除其余的唯一行。 For this example, only r2 and r4 will be in the subset. 对于此示例，只有r2和r4将在子集中。

What I want is that any of the rows with duplicate values are selected randomly rather than the first row. 我想要的是任何具有重复值的行都是随机选择的，而不是第一行。 So for this example, for A2 == 1 , r2 or r3 is selected randomly or for A2 == 2 any of r4, r5 or r6 is selected randomly. 因此，对于此示例，对于A2 == 1 ，随机选择r2或r3，或者对于A2 == 2 ，随机选择r4，r5或r6中的任何一个。 How would I go about implementing this? 我将如何实施呢？

Answer 1

Shuffle the DataFrame first and then drop the duplicates: 首先随机播放DataFrame，然后删除重复项：

df.sample(frac=1).drop_duplicates(subset='A2')

If the order of the rows is important you can use sort_index as @cᴏʟᴅsᴘᴇᴇᴅ suggested: 如果行的顺序很重要，则可以按@cᴏʟᴅsᴘᴇᴇᴅ建议使用sort_index ：

df.sample(frac=1).drop_duplicates(subset='A2').sort_index()

从熊猫的数据框中随机选择唯一的行

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-11-13 19:25:54

从熊猫的数据框中随机选择唯一的行

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-11-13 19:25:54

解决方案1
3 已采纳 2017-11-13 19:25:54