仅保留来自熊猫数据框的唯一重复项

Question

编辑：给定示例的所需输出：

first second third fourth fifth
1     2      3     4      5

编辑2：将count（）更改为size（）

在分析数据时，我遇到了多个实例，这些实例中我想返回所有重复的行，但每个重复仅返回一行。 我正在尝试使用Python 3在Pandas中这样做。

使用groupby和count可以获取所需的输出，但这并不直观。 大熊猫的“重复”功能不会返回所需的输出，因为如果重复项多于两个，它将返回多行。

    data = [[1,2,3,4,5],
           [1,2,3,4,5],
           [1,2,3,4,5],
           [4,5,6,7,8]]

    x.columns = ['first','second','third','fourth','fifth']

    x = pd.DataFrame(data)

    x.groupby(list(x.columns)).size() > 1

groupby函数在使用时返回所需的数据帧输出

x[x.duplicated(keep='first')]

仍会返回重复的行。 是否有仅返回唯一重复项的更Python方式？

Answer 1

采用

x.drop_duplicates()


first   second  third   fourth  fifth
0   1   2   3   4   5
3   4   5   6   7   8

Answer 2

您可以将已选择的内容与duplicated ，然后将drop_duplicates例如：

print (x[x.duplicated()].drop_duplicates())
   first  second  third  fourth  fifth
1      1       2      3       4      5

Answer 3

您仍然可以使用.duplicated()来检查该行是否重复。 如果重复，则返回True 。

之后，我们创建一个标志，然后执行循环以仅获取重复的行。 检查我的代码以获取详细信息。

import pandas as pd

data = [[1,2,3,4,5],
        [1,2,3,4,5],
        [1,2,3,4,5],
        [4,5,6,7,8]]

x = pd.DataFrame(data)
x.columns = ['first','second','third','fourth','fifth']

lastFlag = False # create a flag for duplicated rows
dupl = x.duplicated() # check which row is a duplicate
for i in range(len(dupl)): # looping into the list
    # get the first duplicate and print it
    if lastFlag != dupl[i]:
        lastFlag = dupl[i]
        if dupl[i]:
            print(x.iloc[i, :]) # this print in pandas.Series type

希望这可以帮助。

仅保留来自熊猫数据框的唯一重复项

问题描述

3 个解决方案

解决方案1
0 2019-08-13 01:12:17

解决方案2
0 已采纳 2019-08-13 01:27:52

解决方案3
0 2019-08-13 01:31:22

仅保留来自熊猫数据框的唯一重复项

问题描述

3 个解决方案

解决方案1 0 2019-08-13 01:12:17

解决方案2 0 已采纳 2019-08-13 01:27:52

解决方案3 0 2019-08-13 01:31:22

解决方案1
0 2019-08-13 01:12:17

解决方案2
0 已采纳 2019-08-13 01:27:52

解决方案3
0 2019-08-13 01:31:22