簡體   English   中英

Python / Pandas:如何使用NaN合並不同行中的重復行?

[英]Python/Pandas: How to consolidate repeated rows with NaN in different columns?

一定有更好的方法可以做到這一點,請幫助我

這是我必須清除的一些數據的摘錄,其中包含幾種“重復”行(並非所有行都是重復的):

df =

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
   100 | ABC        | Paid       |         NaN |        34200 |
   100 | ABC        | Paid       |         724 |        34200 |
   200 | DEF        | Write Off  |         611 |         9800 |
   200 | DEF        | Write Off  |         611 |          NaN |
   300 | GHI        | Paid       |         NaN |       247112 |
   300 | GHI        | Paid       |         799 |          NaN |
   400 | JKL        | Paid       |         NaN |          NaN |
   500 | MNO        | Paid       |         444 |          NaN |

因此,我有以下幾種重復情況:

  1. NaN和CreditScore列中的有效值(LoanID = 100)
  2. NaN和AnnualIncome列中的有效值(LoanID = 200)
  3. NaN和CreditScore列中的有效值,以及NaN和AnnualIncome列中的有效值(貸款ID = 300)
  4. LoanID 400和500是“正常”情況

因此,顯然我想要的是一個沒有重復的數據框,例如:

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
   100 | ABC        | Paid       |         724 |        34200 |
   200 | DEF        | Write Off  |         611 |         9800 |
   300 | GHI        | Paid       |         799 |       247112 |
   400 | JKL        | Paid       |         NaN |          NaN |
   500 | MNO        | Paid       |         444 |          NaN |

所以,我如何解決這個問題:

# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]

# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
    df.loc[df['LoanID'] == i, 'CreditScore']  = df[df['LoanID'] == i]['CreditScore'].max()
    df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()

# Drop duplicates   
df.drop_duplicates(inplace=True)

這行得通,正是我所需要的,問題是此數據幀是幾個100k記錄,因此此方法需要“永遠”,必須有某種方法可以做得更好,對吧?

按貸款ID分組,在上方和下方填寫缺失值並刪除重復項似乎可行:

df.groupby('LoanID').apply(lambda x: \
                             fillna(method='ffill').\
                             fillna(method='bfill').\
                             drop_duplicates()).\
                     reset_index(drop=True).\
                     set_index('LoanID')
#       CustomerID LoanStatus  CreditScore  AnnualIncome  
#LoanID                                                             
#100           ABC       Paid        724.0       34200.0       
#200           DEF  Write Off        611.0        9800.0       
#300           GHI       Paid        799.0      247112.0       
#400           JKL       Paid          NaN           NaN       
#500           MNO       Paid        444.0           NaN       

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM