简体   繁体   English

如何删除重复项并保持熊猫的第一价值?

[英]How do I drop duplicates and keep the first value on pandas?

I want to drop duplicates and keep the first value. 我想删除重复项并保留第一个值。 The duplicates that want to be dropped is A = 'df' .Here's my data 要删除的重复项是A ='df'。这是我的数据

A   B   C   D   E
qw  1   3   1   1
er  2   4   2   6
ew  4   8   44  4
df  34  34  34  34
df  2   5   2   2
df  3   3   7   3
df  4   4   7   4
we  2   5   5   2
we  4   4   4   4
df  34  9   34  34
df  3   3   9   3
we  4   7   4   4
qw  2   2   7   2

So the result will be 因此结果将是

A   B   C   D   E
qw  1   3   1   1
er  2   4   2   6
ew  4   8   44  4
**df**  34  34  34  34
we  2   5   5   2
we  4   4   4   4
**df**  34  9   34  34
we  4   7   4   4
qw  2   2   7   2

Create helper Series for distinguish consecutive values in A column and then filter by boolean indexing with inverted (~) boolean mask created by duplicated chained with another mask for compare value df : 创建帮助程序Series以区分A列中的连续值,然后使用boolean indexing进行过滤,该boolean indexing由反向(~)布尔掩码创建的,该掩码由与另一个掩码比较的df duplicated链接而创建:

s = df['A'].ne(df['A'].shift()).cumsum()
df = df[~((df['A'] == 'df') & (s.duplicated()))]
print (df)
     A   B   C   D   E
0   qw   1   3   1   1
1   er   2   4   2   6
2   ew   4   8  44   4
3   df  34  34  34  34
7   we   2   5   5   2
8   we   4   4   4   4
9   df  34   9  34  34
11  we   4   7   4   4
12  qw   2   2   7   2

Another idea, with the benefit of being more readable in my opinion, would be to only shift the index where df.A == "df" and store the ids where the differences are equal to 1. These columns we drop with df.drop() . 依我之见,另一个想法是更具可读性,它的好处是仅将df.A == "df"的索引移位,并将id存储在差异等于1的位置。这些列我们使用df.drop()

idx = df[df.A == "df"].index             # [3, 4, 5, 6, 9, 10]
m = idx - np.roll(idx, 1) == 1           # [False, True, True, True, False, True]
df.drop(idx[m], inplace = True)          # [4,5,6,10]                <-- These we drop

Time comparison 时间比较

Runs equally fast as jezrael using the test sample below. 使用以下测试样本,运行速度与jezrael相同。

1000 loops, best of 3: 1.38 ms per loop 1000次循环,最好为3:每个循环1.38毫秒

1000 loops, best of 3: 1.38 ms per loop 1000次循环,最好为3:每个循环1.38毫秒


Full example 完整的例子

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'A': {0: 'qw', 1: 'er', 2: 'ew', 3: 'df', 4: 'df', 5: 'df', 6: 'df', 7: 'we', 
            8: 'we', 9: 'df', 10: 'df', 11: 'we', 12: 'qw'}, 
    'B': {0: 1, 1: 2, 2: 4, 3: 34, 4: 2, 5: 3, 6: 4, 7: 2, 8: 4, 9: 34, 10: 3, 
          11: 4, 12: 2}, 
    'C': {0: 3, 1: 4, 2: 8, 3: 34, 4: 5, 5: 3, 6: 4, 7: 5, 8: 4, 9: 9, 10: 3, 
          11: 7, 12: 2}, 
    'D': {0: 1, 1: 2, 2: 44, 3: 34, 4: 2, 5: 7, 6: 7, 7: 5, 8: 4, 9: 34, 10: 9, 
          11: 4, 12: 7}, 
    'E': {0: 1, 1: 6, 2: 4, 3: 34, 4: 2, 5: 3, 6: 4, 7: 2, 8: 4, 9: 34, 10: 3, 
          11: 4, 12: 2}}
)

idx = df[df.A == "df"].index
m = idx - np.roll(idx, 1) == 1
df.drop(idx[m], inplace = True)

Using cumcount() 使用cumcount()

import pandas as pd
import numpy as np
df['cum'] = df.groupby(['A']).cumcount()
df['cum2'] = np.append([0],np.diff(df.cum))
df.query("~((A == 'df') & (cum2 == 1))").drop(['cum','cum2'],axis=1)

df looks like: df看起来像:

In [6]: df
Out[6]: 
     A   B   C   D   E  cum
0   qw   1   3   1   1    0
1   er   2   4   2   6    0
2   ew   4   8  44   4    0
3   df  34  34  34  34    0
4   df   2   5   2   2    1
5   df   3   3   7   3    2
6   df   4   4   7   4    3
7   we   2   5   5   2    0
8   we   4   4   4   4    1
9   df  34   9  34  34    4
10  df   3   3   9   3    5
11  we   4   7   4   4    2
12  qw   2   2   7   2    1

np.diff 差异

In [7]: df['cum2'] = np.append([0],np.diff(df.cum))

In [8]: df
Out[8]: 
     A   B   C   D   E  cum  cum2
0   qw   1   3   1   1    0     0
1   er   2   4   2   6    0     0
2   ew   4   8  44   4    0     0
3   df  34  34  34  34    0     0
4   df   2   5   2   2    1     1
5   df   3   3   7   3    2     1
6   df   4   4   7   4    3     1
7   we   2   5   5   2    0    -3
8   we   4   4   4   4    1     1
9   df  34   9  34  34    4     3
10  df   3   3   9   3    5     1
11  we   4   7   4   4    2    -3
12  qw   2   2   7   2    1    -1

output 输出

In [12]: df.query("~((A == 'df') & (cum2 == 1))").drop(['cum','cum2'],axis=1)
Out[12]: 
     A   B   C   D   E
0   qw   1   3   1   1
1   er   2   4   2   6
2   ew   4   8  44   4
3   df  34  34  34  34
7   we   2   5   5   2
8   we   4   4   4   4
9   df  34   9  34  34
11  we   4   7   4   4
12  qw   2   2   7   2

reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html 参考: https : //pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM