![](/img/trans.png)
[英]'subset' not working for drop_duplicates pandas dataframe
[英]Creating a Backup Dataframe from Entries Discarded by Pandas drop_duplicates()
我有一個熊貓數據框,ds。 我想從名為“名稱”的特定列中刪除重復的條目。
+---------+------+-------+----------+--------+
| Invoice | Name | Price | Date | Coupon |
+---------+------+-------+----------+--------+
| 123412 | Jim | 50 | 12/01/17 | ALBB1 |
| 431311 | Jane | 25 | 12/02/17 | BB2 |
| 134123 | Joe | 70 | 12/03/17 | BB2 |
| 333131 | Jim | 85 | 12/04/17 | ALBB1 |
+---------+------+-------+----------+--------+
這是我的代碼:
ds = ds.drop_duplicates(subset='Name', keep='first')
我正在使用keep='first'
選項來保留在數據框中找到的第一個實例。
我想做的是根據所有丟棄的條目創建一個單獨的數據框。
因此,在此示例中。 第二個數據幀ds2將等於:
+---------+------+-------+----------+--------+
| Invoice | Name | Price | Date | Coupon |
+---------+------+-------+----------+--------+
| 333131 | Jim | 85 | 01/04/18 | ALBB1 |
+---------+------+-------+----------+--------+
對布爾掩碼使用duplicated
,並通過boolean indexing
過濾。
注意: keep='first'
應該被省略,因為默認值
df1 = df[df.duplicated(subset='Name')]
print (df1)
Invoice Name Price Date Coupon
3 333131 Jim 85 12/04/17 ALBB1
此布爾掩碼可以用於生成兩個DataFrame
, ~
用於反轉布爾掩碼:
m = df.duplicated(subset='Name')
df1 = df[m]
print (df1)
Invoice Name Price Date Coupon
3 333131 Jim 85 12/04/17 ALBB1
df1 = df[~m]
print (df1)
Invoice Name Price Date Coupon
0 123412 Jim 50 12/01/17 ALBB1
1 431311 Jane 25 12/02/17 BB2
2 134123 Joe 70 12/03/17 BB2
詳情:
print (m)
0 False
1 False
2 False
3 True
dtype: bool
print (~m)
0 True
1 True
2 True
3 False
dtype: bool
編輯:
還可以使用keep='last'
提取所有不帶倒數的重復項,或者keep=False
提取所有重復值:
print (df)
Invoice Name Price Date Coupon
0 123412 Jim 50 12/01/17 ALBB1
1 431311 Jane 25 12/02/17 BB2
2 134123 Joe 70 12/03/17 BB2
3 333131 Jim 85 12/04/17 ALBB1
4 333131 Jim 86 12/04/17 ALBB2 <- added new dupe row
m = df.duplicated(subset='Name')
df11 = df[m]
print (df11)
Invoice Name Price Date Coupon
3 333131 Jim 85 12/04/17 ALBB1
4 333131 Jim 86 12/04/17 ALBB2
m = df.duplicated(subset='Name', keep='last')
df12 = df[m]
print (df12)
Invoice Name Price Date Coupon
0 123412 Jim 50 12/01/17 ALBB1
3 333131 Jim 85 12/04/17 ALBB1
m = df.duplicated(subset='Name', keep=False)
df13 = df[m]
print (df13)
Invoice Name Price Date Coupon
0 123412 Jim 50 12/01/17 ALBB1
3 333131 Jim 85 12/04/17 ALBB1
4 333131 Jim 86 12/04/17 ALBB2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.