简体   繁体   中英

Loop or iterate through columns in a DataFrame to replace null values

I have a dataframe that is 762106 rows x 79 columns. There are 14 'sets' of three columns with each column indicating a different level of 'intensity' for a given feature, and NaN where there is a value in another column. They are already encoded and I want to condense them into a single column so that instead of 42 of these columns I have 14.

A subset can be recreated like this:

import pandas as pd
import numpy as np    
df = pd.DataFrame([[np.nan, 2, np.nan, 1, np.nan, np.nan, np.nan, np.nan, 3],
                    [1, np.nan, np.nan, np.nan, 2, np.nan, 1, np.nan, np.nan],
                    [np.nan, np.nan, 3, 1, np.nan, np.nan, np.nan, 2, np.nan]],
                   columns=['a','aa','aaa','b','bb','bbb','c','cc','ccc'])

Output:

    a       aa      aaa     b       bb      bbb     c       cc      ccc
0   NaN     2.0     NaN     1.0     NaN     NaN     NaN     NaN     3.0
1   1.0     NaN     NaN     NaN     2.0     NaN     1.0     NaN     NaN
2   NaN     NaN     3.0     1.0     NaN     NaN     NaN     2.0     NaN

I want them to look like this:

    a   b   c
0   2   1   3
1   1   2   1
2   3   1   2

My current solution is to take values from aa, aaa, etc using .fillna() and then use .drop() to drop the superfluous columns:

df['a'] = df['a'].fillna(df['aa']).fillna(df['aaa'])
df = df.drop(['aa','aaa'],axis = 1)
    
df['b'] = df['b'].fillna(df['bb']).fillna(df['bbb'])
df = df.drop(['bb','bbb'],axis = 1)

And this works, but I want to know if there is a more elegant way to accomplish this without copy pasting this code block 14 times.

You can use pandas.DataFrame.groupby with axis = 1 ("columns"):

df.groupby(lambda x: x[0], axis = 1).sum()
     a    b    c
0  2.0  1.0  3.0
1  1.0  2.0  1.0
2  3.0  1.0  2.0

If groupby is used with a function, it's called on each value of the object's index, in this case, the columns names.

Since you can group by any function, it can be a really flexible solution.

You can grouping by first letter in columns names with GroupBy.first :

df = df.groupby(df.columns.str[0], axis=1).first()
print (df)
     a    b    c
0  2.0  1.0  3.0
1  1.0  2.0  1.0
2  3.0  1.0  2.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM