![](/img/trans.png)
[英]Pandas: How to groupby a dataframe and convert the rows to columns and consolidate the rows
[英]Python/Pandas: How to consolidate repeated rows with NaN in different columns?
一定有更好的方法可以做到這一點,請幫助我
這是我必須清除的一些數據的摘錄,其中包含幾種“重復”行(並非所有行都是重復的):
df =
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | NaN | 34200 |
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
200 | DEF | Write Off | 611 | NaN |
300 | GHI | Paid | NaN | 247112 |
300 | GHI | Paid | 799 | NaN |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
因此,我有以下幾種重復情況:
因此,顯然我想要的是一個沒有重復的數據框,例如:
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
300 | GHI | Paid | 799 | 247112 |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
所以,我如何解決這個問題:
# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]
# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
df.loc[df['LoanID'] == i, 'CreditScore'] = df[df['LoanID'] == i]['CreditScore'].max()
df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()
# Drop duplicates
df.drop_duplicates(inplace=True)
這行得通,正是我所需要的,問題是此數據幀是幾個100k記錄,因此此方法需要“永遠”,必須有某種方法可以做得更好,對吧?
按貸款ID分組,在上方和下方填寫缺失值並刪除重復項似乎可行:
df.groupby('LoanID').apply(lambda x: \
fillna(method='ffill').\
fillna(method='bfill').\
drop_duplicates()).\
reset_index(drop=True).\
set_index('LoanID')
# CustomerID LoanStatus CreditScore AnnualIncome
#LoanID
#100 ABC Paid 724.0 34200.0
#200 DEF Write Off 611.0 9800.0
#300 GHI Paid 799.0 247112.0
#400 JKL Paid NaN NaN
#500 MNO Paid 444.0 NaN
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.