[英]Python/Pandas: How to consolidate repeated rows with NaN in different columns?
There must be a better way to do this, please help me 一定有更好的方法可以做到这一点,请帮助我
Here's an extract of some of the data I have to clean, which has several kind of "duplicate" rows (not all the row is duplicated): 这是我必须清除的一些数据的摘录,其中包含几种“重复”行(并非所有行都是重复的):
df = df =
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | NaN | 34200 |
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
200 | DEF | Write Off | 611 | NaN |
300 | GHI | Paid | NaN | 247112 |
300 | GHI | Paid | 799 | NaN |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
So I have the following type of duplicate cases: 因此,我有以下几种重复情况:
So, obviously what I want is to have a dataframe without the duplicates like: 因此,显然我想要的是一个没有重复的数据框,例如:
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
300 | GHI | Paid | 799 | 247112 |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
So, how I have solved this with: 所以,我如何解决这个问题:
# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]
# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
df.loc[df['LoanID'] == i, 'CreditScore'] = df[df['LoanID'] == i]['CreditScore'].max()
df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()
# Drop duplicates
df.drop_duplicates(inplace=True)
This works, does exactly what I need, the problem is that this dataframe is several 100k records, so this method takes "forever", there must be some way to do it better, right? 这行得通,正是我所需要的,问题是此数据帧是几个100k记录,因此此方法需要“永远”,必须有某种方法可以做得更好,对吧?
Grouping by loan id, filling in missing values both above and below, and removing duplicates seems to work: 按贷款ID分组,在上方和下方填写缺失值并删除重复项似乎可行:
df.groupby('LoanID').apply(lambda x: \
fillna(method='ffill').\
fillna(method='bfill').\
drop_duplicates()).\
reset_index(drop=True).\
set_index('LoanID')
# CustomerID LoanStatus CreditScore AnnualIncome
#LoanID
#100 ABC Paid 724.0 34200.0
#200 DEF Write Off 611.0 9800.0
#300 GHI Paid 799.0 247112.0
#400 JKL Paid NaN NaN
#500 MNO Paid 444.0 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.