简体   繁体   中英

Python pandas dataframe fillna with column in different dataframe

I want to fill Nan values in 'gvkey' column with another column values in different dataframe.

df
     wrds_id    isin_code   gvkey   gvkey_new
 0  1004    US0003611052    1004.0  1004.0
 1  1005    US1948302047    Nan     1000.0
 2  1006    US1948302047    Nan     1004.0
 3  1007    US0309541011    Nan     1004.0
 4  1007    US0003611052    1004.0  1004.0
 5  1008    IL0006046119    Nan     1004.0
 6  1008    US0003611052    1004.0  1004.0
 7  1009    US4448591028    Nan     1004.0
 8  1004    US4448591028    Nan     1004.0
 9  1004    US4448591028    Nan     1004.0
 10 1013    US0008861017    1013.0  1013.0
 11 1013    BE0003755692    Nan     1013.0
 12 1013    BE0003755692    Nan     1013.0

Using this frame, based on isin_code values, replace Nan value with gvkey_ciq_new values in first dataframe

df1
    isin_code   gvkey_ciq_new
0   US0309541011    1485.0
1   IL0006046119    2018.0
3   US1948302047    3176.0
4   US2376881064    3760.0
5   BE0003755692    5150.0
7   US4448591028    5776.0
9   GB0004544929    5898.0

Final dataframe I want to make,

finaldf
     wrds_id    isin_imp    gvkey   gvkey_new
 0  1004    US0003611052    1004.0  1004.0
 1  1005    US1948302047    3176.0  1004.0
 2  1006    US1948302047    3176.0  1004.0
 3  1007    US0309541011    1485.0  1004.0
 4  1007    US0003611052    1004.0  1004.0
 5  1008    IL0006046119    2018.0  1004.0
 6  1008    US0003611052    1004.0  1004.0
 7  1009    US4448591028    5776.0  1004.0
 8  1004    US4448591028    5776.0  1004.0
 9  1004    US4448591028    5776.0  1004.0
 10 1013    US0008861017    1013.0  1013.0
 11 1013    BE0003755692    5150.0  1013.0
 12 1013    BE0003755692    5150.0  1013.0

How can I use map function to make the dataframe?

Don't use any loops. Merge the dataframes and use numpy.where with pandas.inull

Set things up

from io import StringIO

import numpy
import pandas

d1 = StringIO("""\
     wrds_id    isin_code   gvkey   gvkey_new
 0  1004    US0003611052    1004.0  1004.0
 1  1005    US1948302047    Nan     1000.0
 2  1006    US1948302047    Nan     1004.0
 3  1007    US0309541011    Nan     1004.0
 4  1007    US0003611052    1004.0  1004.0
 5  1008    IL0006046119    Nan     1004.0
 6  1008    US0003611052    1004.0  1004.0
 7  1009    US4448591028    Nan     1004.0
 8  1004    US4448591028    Nan     1004.0
 9  1004    US4448591028    Nan     1004.0
 10 1013    US0008861017    1013.0  1013.0
 11 1013    BE0003755692    Nan     1013.0
 12 1013    BE0003755692    Nan     1013.0
 """)

d2 = StringIO("""\
    isin_code   gvkey_ciq_new
0   US0309541011    1485.0
1   IL0006046119    2018.0
3   US1948302047    3176.0
4   US2376881064    3760.0
5   BE0003755692    5150.0
7   US4448591028    5776.0
9   GB0004544929    5898.0
""")
df1 = pandas.read_table(d1, sep='\s+', na_values=['Nan'])
df2 = pandas.read_table(d2, sep='\s+', na_values=['Nan'])

Merge and compute the final column

result = (
    df1.merge(df2, on='isin_code', how='left')
       .assign(gvkey=lambda df: numpy.where(
           pandas.isnull(df['gvkey']),  # condition
           df['gvkey_ciq_new'],         # values if true
           df['gvkey']                  # values if false
       ))[df1.columns]                  # only keep original columns
)

print(result)

And that gives me:

    wrds_id     isin_code   gvkey  gvkey_new
0      1004  US0003611052  1004.0     1004.0
1      1005  US1948302047  3176.0     1000.0
2      1006  US1948302047  3176.0     1004.0
3      1007  US0309541011  1485.0     1004.0
4      1007  US0003611052  1004.0     1004.0
5      1008  IL0006046119  2018.0     1004.0
6      1008  US0003611052  1004.0     1004.0
7      1009  US4448591028  5776.0     1004.0
8      1004  US4448591028  5776.0     1004.0
9      1004  US4448591028  5776.0     1004.0
10     1013  US0008861017  1013.0     1013.0
11     1013  BE0003755692  5150.0     1013.0
12     1013  BE0003755692  5150.0     1013.0

First, make a temporary df without NaNs. Then you can use boolean indexing:

df_tmp = df[df.gvkey != pd.np.nan]
for code, gv in zip(df_tmp["isin_code"], df_tmp["gvkey"]):
    df1.at[df1.isin_code == code,"gvkey"] = gv

Maybe not the most elegant solution, but it should work.

EDIT : Or you add in the loop something like

if gv != pd.np.nan:

Then you don't need the df_tmp

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM