簡體   English   中英

根據被熊貓丟棄的條目創建備份數據框drop_duplicates()

[英]Creating a Backup Dataframe from Entries Discarded by Pandas drop_duplicates()

我有一個熊貓數據框,ds。 我想從名為“名稱”的特定列中刪除重復的條目。

+---------+------+-------+----------+--------+
| Invoice | Name | Price |   Date   | Coupon |
+---------+------+-------+----------+--------+
|  123412 | Jim  |    50 | 12/01/17 | ALBB1  |
|  431311 | Jane |    25 | 12/02/17 | BB2    |
|  134123 | Joe  |    70 | 12/03/17 | BB2    |
|  333131 | Jim  |    85 | 12/04/17 | ALBB1  |
+---------+------+-------+----------+--------+

這是我的代碼:

ds = ds.drop_duplicates(subset='Name', keep='first')

我正在使用keep='first'選項來保留在數據框中找到的第一個實例。

我想做的是根據所有丟棄的條目創建一個單獨的數據框。

因此,在此示例中。 第二個數據幀ds2將等於:

+---------+------+-------+----------+--------+
| Invoice | Name | Price |   Date   | Coupon |
+---------+------+-------+----------+--------+
|  333131 | Jim  |    85 | 01/04/18 | ALBB1  |
+---------+------+-------+----------+--------+

對布爾掩碼使用duplicated ,並通過boolean indexing過濾。

注意: keep='first'應該被省略,因為默認值

df1 = df[df.duplicated(subset='Name')]
print (df1)
   Invoice Name  Price      Date Coupon
3   333131  Jim     85  12/04/17  ALBB1

此布爾掩碼可以用於生成兩個DataFrame~用於反轉布爾掩碼:

m = df.duplicated(subset='Name')
df1 = df[m]
print (df1)
   Invoice Name  Price      Date Coupon
3   333131  Jim     85  12/04/17  ALBB1

df1 = df[~m]
print (df1)

   Invoice  Name  Price      Date Coupon
0   123412   Jim     50  12/01/17  ALBB1
1   431311  Jane     25  12/02/17    BB2
2   134123   Joe     70  12/03/17    BB2

詳情:

print (m)
0    False
1    False
2    False
3     True
dtype: bool

print (~m)

0     True
1     True
2     True
3    False
dtype: bool

編輯:

還可以使用keep='last'提取所有不帶倒數的重復項,或者keep=False提取所有重復值:

print (df)
   Invoice  Name  Price      Date Coupon
0   123412   Jim     50  12/01/17  ALBB1
1   431311  Jane     25  12/02/17    BB2
2   134123   Joe     70  12/03/17    BB2
3   333131   Jim     85  12/04/17  ALBB1
4   333131   Jim     86  12/04/17  ALBB2 <- added new dupe row

m = df.duplicated(subset='Name')
df11 = df[m]
print (df11)
   Invoice Name  Price      Date Coupon
3   333131  Jim     85  12/04/17  ALBB1
4   333131  Jim     86  12/04/17  ALBB2

m = df.duplicated(subset='Name', keep='last')
df12 = df[m]
print (df12)
   Invoice Name  Price      Date Coupon
0   123412  Jim     50  12/01/17  ALBB1
3   333131  Jim     85  12/04/17  ALBB1

m = df.duplicated(subset='Name', keep=False)
df13 = df[m]
print (df13)
  Invoice Name  Price      Date Coupon
0   123412  Jim     50  12/01/17  ALBB1
3   333131  Jim     85  12/04/17  ALBB1
4   333131  Jim     86  12/04/17  ALBB2

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM