如何丟棄重復項但如果某個特定的其他列不為空則保留行（Pandas）

Question

我有很多重復記錄 - 其中一些有銀行帳戶。 我想用銀行帳戶保存記錄。

基本上是這樣的：

if there are two Tommy Joes:
     keep the one with a bank account

我試圖使用下面的代碼進行重復數據刪除，但它保留了沒有銀行帳戶的欺騙。

df = pd.DataFrame({'firstname':['foo Bar','Bar Bar','Foo Bar','jim','john','mary','jim'],
                   'lastname':['Foo Bar','Bar','Foo Bar','ryan','con','sullivan','Ryan'],
                   'email':['Foo bar','Bar','Foo Bar','jim@com','john@com','mary@com','Jim@com'],
                   'bank':[np.nan,'abc','xyz',np.nan,'tge','vbc','dfg']})


df


  firstname  lastname     email bank
0   foo Bar   Foo Bar   Foo bar  NaN  
1   Bar Bar       Bar       Bar  abc
2   Foo Bar   Foo Bar   Foo Bar  xyz
3       jim      ryan   jim@com  NaN
4      john       con  john@com  tge
5      mary  sullivan  mary@com  vbc
6       jim      Ryan   Jim@com  dfg



# get the index of unique values, based on firstname, lastname, email
# convert to lower and remove white space first

uniq_indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
.applymap(lambda s:s.lower() if type(s) == str else s)
.applymap(lambda x: x.replace(" ", "") if type(x)==str else x)
.drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index


# save unique records
dfiban_uniq = df.loc[uniq_indx]

dfiban_uniq



  firstname  lastname     email bank
0   foo Bar   Foo Bar   Foo bar  NaN # should not be here
1   Bar Bar       Bar       Bar  abc
3       jim      ryan   jim@com  NaN # should not be here
4      john       con  john@com  tge
5      mary  sullivan  mary@com  vbc


# I wanted these duplicates to appear in the result:

  firstname  lastname     email bank
2   Foo Bar   Foo Bar   Foo Bar  xyz  
6       jim      Ryan   Jim@com  dfg

你可以看到保留索引0和3。 這些擁有銀行帳戶的客戶的版本已被刪除。 我的預期結果是反過來。 刪除沒有銀行帳戶的欺騙。

我已經考慮過首先通過銀行帳戶進行排序，但是我有太多的數據，我不確定如何“檢查”它以查看它是否有效。

任何幫助贊賞。

這里有一些類似的問題但是所有這些問題似乎都有可以分類的值，例如年齡等。這些哈希的銀行賬號非常雜亂

編輯：

在我的真實數據集上嘗試回答的一些結果。

@Erfan的方法按子集+銀行排序值

重復數據刪除后剩余的58594條記錄：

subset = ['firstname', 'lastname']

df[subset] = df[subset].apply(lambda x: x.str.lower())
df[subset] = df[subset].apply(lambda x: x.replace(" ", ""))
df.sort_values(subset + ['bank'], inplace=True)
df.drop_duplicates(subset, inplace=True)

print(df.shape[0])

58594

@ Adam.Er8使用銀行的排序值回答。 重復數據刪除后剩余59170條記錄：

uniq_indx = (df.sort_values(by="bank", na_position='last').dropna(subset=['firstname', 'lastname', 'email'])
             .applymap(lambda s: s.lower() if type(s) == str else s)
             .applymap(lambda x: x.replace(" ", "") if type(x) == str else x)
             .drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index

df.loc[uniq_indx].shape[0]

59170

不確定為什么差異，但兩者都足夠相似。

Answer 1

你應該按bank列對值進行排序，使用na_position='last' （所以.drop_duplicates(..., keep='first')將保留一個非na的值）。

試試這個：

import pandas as pd
import numpy as np

df = pd.DataFrame({'firstname': ['foo Bar', 'Bar Bar', 'Foo Bar'],
                   'lastname': ['Foo Bar', 'Bar', 'Foo Bar'],
                   'email': ['Foo bar', 'Bar', 'Foo Bar'],
                   'bank': [np.nan, 'abc', 'xyz']})

uniq_indx = (df.sort_values(by="bank", na_position='last').dropna(subset=['firstname', 'lastname', 'email'])
             .applymap(lambda s: s.lower() if type(s) == str else s)
             .applymap(lambda x: x.replace(" ", "") if type(x) == str else x)
             .drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index

# save unique records
dfiban_uniq = df.loc[uniq_indx]

print(dfiban_uniq)

輸出：

  bank    email firstname lastname
1  abc      Bar   Bar Bar      Bar
2  xyz  Foo Bar   Foo Bar  Foo Bar

（這只是你在uniq_indx = ...開頭的uniq_indx = ... .sort_values(by="bank", na_position='last')的原始代碼

Answer 2

您可以在drop_duplicates之前按銀行帳戶進行排序，以便最后使用NaN放置重復項：

uniq_indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
.applymap(lambda s:s.lower() if type(s) == str else s)
.applymap(lambda x: x.replace(" ", "") if type(x)==str else x)
.sort_values(by='bank')  # here we sort values by bank column
.drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index

Answer 3

方法1：str.lower，sort＆drop_duplicates

這適用於許多列

subset = ['firstname', 'lastname']

df[subset] = df[subset].apply(lambda x: x.str.lower())
df.sort_values(subset + ['bank'], inplace=True)
df.drop_duplicates(subset, inplace=True)

  firstname lastname    email bank
1   bar bar      bar      Bar  abc
2   foo bar  foo bar  Foo Bar  xyz

方法2：groupby，agg，first

不容易推廣到很多列

df.groupby([df['firstname'].str.lower(), df['lastname'].str.lower()], sort=False)\
  .agg({'email':'first','bank':'first'})\
  .reset_index()

  firstname lastname    email bank
0   foo bar  foo bar  Foo bar  xyz
1   bar bar      bar      Bar  abc

Answer 4

在刪除重復項之前，按降序對值進行排序。 這將確保NANS不會名列前茅

如何丟棄重復項但如果某個特定的其他列不為空則保留行（Pandas）

問題描述

4 個解決方案

解決方案1
1 已采納 2019-07-02 12:40:39

解決方案2
1 2019-07-02 12:48:00

解決方案3
1 2019-07-02 12:49:08

方法1：str.lower，sort＆drop_duplicates

方法2：groupby，agg，first

解決方案4
0 2019-07-02 12:43:58

如何丟棄重復項但如果某個特定的其他列不為空則保留行（Pandas）

問題描述

4 個解決方案

解決方案1 1 已采納 2019-07-02 12:40:39

解決方案2 1 2019-07-02 12:48:00

解決方案3 1 2019-07-02 12:49:08

方法1：str.lower，sort＆drop_duplicates

方法2：groupby，agg，first

解決方案4 0 2019-07-02 12:43:58

解決方案1
1 已采納 2019-07-02 12:40:39

解決方案2
1 2019-07-02 12:48:00

解決方案3
1 2019-07-02 12:49:08

解決方案4
0 2019-07-02 12:43:58