[英]How do I drop duplicates and keep the first value on pandas?
我想刪除重復項並保留第一個值。 要刪除的重復項是A ='df'。這是我的數據
A B C D E
qw 1 3 1 1
er 2 4 2 6
ew 4 8 44 4
df 34 34 34 34
df 2 5 2 2
df 3 3 7 3
df 4 4 7 4
we 2 5 5 2
we 4 4 4 4
df 34 9 34 34
df 3 3 9 3
we 4 7 4 4
qw 2 2 7 2
因此結果將是
A B C D E
qw 1 3 1 1
er 2 4 2 6
ew 4 8 44 4
**df** 34 34 34 34
we 2 5 5 2
we 4 4 4 4
**df** 34 9 34 34
we 4 7 4 4
qw 2 2 7 2
創建幫助程序Series
以區分A
列中的連續值,然后使用boolean indexing
進行過濾,該boolean indexing
由反向(~)
布爾掩碼創建的,該掩碼由與另一個掩碼比較的df
duplicated
鏈接而創建:
s = df['A'].ne(df['A'].shift()).cumsum()
df = df[~((df['A'] == 'df') & (s.duplicated()))]
print (df)
A B C D E
0 qw 1 3 1 1
1 er 2 4 2 6
2 ew 4 8 44 4
3 df 34 34 34 34
7 we 2 5 5 2
8 we 4 4 4 4
9 df 34 9 34 34
11 we 4 7 4 4
12 qw 2 2 7 2
依我之見,另一個想法是更具可讀性,它的好處是僅將df.A == "df"
的索引移位,並將id存儲在差異等於1的位置。這些列我們使用df.drop()
。
idx = df[df.A == "df"].index # [3, 4, 5, 6, 9, 10]
m = idx - np.roll(idx, 1) == 1 # [False, True, True, True, False, True]
df.drop(idx[m], inplace = True) # [4,5,6,10] <-- These we drop
時間比較
使用以下測試樣本,運行速度與jezrael相同。
1000次循環,最好為3:每個循環1.38毫秒
1000次循環,最好為3:每個循環1.38毫秒
完整的例子
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'A': {0: 'qw', 1: 'er', 2: 'ew', 3: 'df', 4: 'df', 5: 'df', 6: 'df', 7: 'we',
8: 'we', 9: 'df', 10: 'df', 11: 'we', 12: 'qw'},
'B': {0: 1, 1: 2, 2: 4, 3: 34, 4: 2, 5: 3, 6: 4, 7: 2, 8: 4, 9: 34, 10: 3,
11: 4, 12: 2},
'C': {0: 3, 1: 4, 2: 8, 3: 34, 4: 5, 5: 3, 6: 4, 7: 5, 8: 4, 9: 9, 10: 3,
11: 7, 12: 2},
'D': {0: 1, 1: 2, 2: 44, 3: 34, 4: 2, 5: 7, 6: 7, 7: 5, 8: 4, 9: 34, 10: 9,
11: 4, 12: 7},
'E': {0: 1, 1: 6, 2: 4, 3: 34, 4: 2, 5: 3, 6: 4, 7: 2, 8: 4, 9: 34, 10: 3,
11: 4, 12: 2}}
)
idx = df[df.A == "df"].index
m = idx - np.roll(idx, 1) == 1
df.drop(idx[m], inplace = True)
使用cumcount()
import pandas as pd
import numpy as np
df['cum'] = df.groupby(['A']).cumcount()
df['cum2'] = np.append([0],np.diff(df.cum))
df.query("~((A == 'df') & (cum2 == 1))").drop(['cum','cum2'],axis=1)
df看起來像:
In [6]: df
Out[6]:
A B C D E cum
0 qw 1 3 1 1 0
1 er 2 4 2 6 0
2 ew 4 8 44 4 0
3 df 34 34 34 34 0
4 df 2 5 2 2 1
5 df 3 3 7 3 2
6 df 4 4 7 4 3
7 we 2 5 5 2 0
8 we 4 4 4 4 1
9 df 34 9 34 34 4
10 df 3 3 9 3 5
11 we 4 7 4 4 2
12 qw 2 2 7 2 1
差異
In [7]: df['cum2'] = np.append([0],np.diff(df.cum))
In [8]: df
Out[8]:
A B C D E cum cum2
0 qw 1 3 1 1 0 0
1 er 2 4 2 6 0 0
2 ew 4 8 44 4 0 0
3 df 34 34 34 34 0 0
4 df 2 5 2 2 1 1
5 df 3 3 7 3 2 1
6 df 4 4 7 4 3 1
7 we 2 5 5 2 0 -3
8 we 4 4 4 4 1 1
9 df 34 9 34 34 4 3
10 df 3 3 9 3 5 1
11 we 4 7 4 4 2 -3
12 qw 2 2 7 2 1 -1
輸出
In [12]: df.query("~((A == 'df') & (cum2 == 1))").drop(['cum','cum2'],axis=1)
Out[12]:
A B C D E
0 qw 1 3 1 1
1 er 2 4 2 6
2 ew 4 8 44 4
3 df 34 34 34 34
7 we 2 5 5 2
8 we 4 4 4 4
9 df 34 9 34 34
11 we 4 7 4 4
12 qw 2 2 7 2
參考: https : //pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.