简体   繁体   English

Python / Pandas:如何使用NaN合并不同行中的重复行?

[英]Python/Pandas: How to consolidate repeated rows with NaN in different columns?

There must be a better way to do this, please help me 一定有更好的方法可以做到这一点,请帮助我

Here's an extract of some of the data I have to clean, which has several kind of "duplicate" rows (not all the row is duplicated): 这是我必须清除的一些数据的摘录,其中包含几种“重复”行(并非所有行都是重复的):

df = df =

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
   100 | ABC        | Paid       |         NaN |        34200 |
   100 | ABC        | Paid       |         724 |        34200 |
   200 | DEF        | Write Off  |         611 |         9800 |
   200 | DEF        | Write Off  |         611 |          NaN |
   300 | GHI        | Paid       |         NaN |       247112 |
   300 | GHI        | Paid       |         799 |          NaN |
   400 | JKL        | Paid       |         NaN |          NaN |
   500 | MNO        | Paid       |         444 |          NaN |

So I have the following type of duplicate cases: 因此,我有以下几种重复情况:

  1. A NaN and a valid value in column CreditScore (LoanID = 100) NaN和CreditScore列中的有效值(LoanID = 100)
  2. A NaN and a valid value in column AnnualIncome (LoanID = 200) NaN和AnnualIncome列中的有效值(LoanID = 200)
  3. A NaN and a valid value in column CreditScore AND a NaN and a valid value in column AnnualIncome (Loan ID=300) NaN和CreditScore列中的有效值,以及NaN和AnnualIncome列中的有效值(贷款ID = 300)
  4. LoanID 400 and 500 are "normal" cases LoanID 400和500是“正常”情况

So, obviously what I want is to have a dataframe without the duplicates like: 因此,显然我想要的是一个没有重复的数据框,例如:

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
   100 | ABC        | Paid       |         724 |        34200 |
   200 | DEF        | Write Off  |         611 |         9800 |
   300 | GHI        | Paid       |         799 |       247112 |
   400 | JKL        | Paid       |         NaN |          NaN |
   500 | MNO        | Paid       |         444 |          NaN |

So, how I have solved this with: 所以,我如何解决这个问题:

# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]

# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
    df.loc[df['LoanID'] == i, 'CreditScore']  = df[df['LoanID'] == i]['CreditScore'].max()
    df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()

# Drop duplicates   
df.drop_duplicates(inplace=True)

This works, does exactly what I need, the problem is that this dataframe is several 100k records, so this method takes "forever", there must be some way to do it better, right? 这行得通,正是我所需要的,问题是此数据帧是几个100k记录,因此此方法需要“永远”,必须有某种方法可以做得更好,对吧?

Grouping by loan id, filling in missing values both above and below, and removing duplicates seems to work: 按贷款ID分组,在上方和下方填写缺失值并删除重复项似乎可行:

df.groupby('LoanID').apply(lambda x: \
                             fillna(method='ffill').\
                             fillna(method='bfill').\
                             drop_duplicates()).\
                     reset_index(drop=True).\
                     set_index('LoanID')
#       CustomerID LoanStatus  CreditScore  AnnualIncome  
#LoanID                                                             
#100           ABC       Paid        724.0       34200.0       
#200           DEF  Write Off        611.0        9800.0       
#300           GHI       Paid        799.0      247112.0       
#400           JKL       Paid          NaN           NaN       
#500           MNO       Paid        444.0           NaN       

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas:如何按 dataframe 分组并将行转换为列并合并行 - Pandas: How to groupby a dataframe and convert the rows to columns and consolidate the rows 如何使用 Python 和 Pandas 将具有相似和不同列的多个 CSV 文件合并为 1? - How to consolidate multiple CSV files with similar and different columns into 1 using Python and Pandas? python pandas迭代两列不同的行并返回重复的一次和单行重复值的对应值 - python pandas iterating rows of two different columns and returning the repeated one once and corresponding values of repeated values in single row 熊猫将重复的列转换为行 - Pandas converting repeated columns as rows Python Pandas:检查行值中的所有列是否为 NaN - Python Pandas: Check if all columns in rows value is NaN Python中不同方法如何合并 - How to consolidate different methods in Python 使用 pandas 组合 python 上不同列的具有 NaN 的特定行 - Combining specific rows that have NaN for a different column on python using pandas 如何在 Python Pandas 中选择字符数不同于 3 或包含至少 1 个字母或没有数据 (NaN) 的行? - How to select rows where number of characters different from 3 or contains at least 1 letter or no data (NaN) in Python Pandas? 如何在熊猫中将n列合并为1行 - How to consolidate n columns to 1 row in Pandas 熊猫如何在所有浮点数均为NaN时删除行 - pandas how to drop rows when all float columns are NaN
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM