简体   繁体   中英

How to map values from multiple columns using fillna() to fill 'nan' values after merging two tables together in pandas?

I have two dataframes regarding building property assessments. One dataframe has multiple columns on financial information while the other has columns containing location information for these buildings. Both of these dataframes do NOT have the same row and column length (the financial dataframe has over 60,000 rows while the location dataframe has just under 50,000 rows). Because the finance dataframe has a longer index, when I merge these two dataframes together, I would like to fill in the 'nan' values on the merged table with the correct mapped values from the columns in the location dataframe that were merged to the financial dataframe. That might be confusing so I will draw it out.

fin_df:                                        loc_df:

BldgID  | Assmnt Phase | Funding Amt           BldgID  | State |   City   
-------------------------------------          --------------------------
  1         Phase 1       $$$$$$$$                1       CO      Denver
  2         Phase 1       $$$$$$$$                2       MN      St. Paul
  2         Phase 2       $$$$$$$$                3       NV      Reno 
  3         Phase 1       $$$$$$$$                4       FL      Miami 
  3         Phase 2       $$$$$$$$ 
  4         Phase 2       $$$$$$$$
  4         Phase 3       $$$$$$$$

You can see in the financial dataframe some of the building IDs repeat because of assessments being in different phases. It is on much larger scale in the actual dataframe. The location dataframe shows each building ID's corresponding location information.

To start off, when I merged the two dataframes together, I made sure to take only the columns from the location dataframe that were not in the financial dataframe like so:

use_cols = fin_df.columns.difference(loc_df.columns)

Then I joined the two dataframes based on their indexes for a clean merge (both dataframes are sorted the same way):

test_mrg = pd.merge(fin_df, loc_df[use_cols], how='left', left_index=True, right_index=True)

When looking at the dataframe, the merging looks good until I reach the point where the location dataframe's index ends. I did a left join because I want to preserve the rows in the left dataframe (financial dataframe) and match what is available on the right dataframe (location dataframe).

Merged dataframe:

 BldgID  | Assmnt Phase | Funding Amt | State |  City           
--------------------------------------------------------          
  1         Phase 1       $$$$$$$$       CO     Denver       
  2         Phase 1       $$$$$$$$       MN     St. Paul       
  2         Phase 2       $$$$$$$$       MN     St. Paul        
  3         Phase 1       $$$$$$$$       NV     Reno       
  3         Phase 2       $$$$$$$$       nan    nan
  4         Phase 2       $$$$$$$$       nan    nan
  4         Phase 3       $$$$$$$$       nan    nan

I know fillna() is a powerful method to fill in nan . I want to replace the nans with the correct corresponding location information based on the building ID's location information.

I tried doing it this way at first:

#store column information in a variable
x = loc_df[use_cols] 

#merge dataframes 
#add 'x' as argument for value parameter in fillna along with iloc to access rows
test_mrg_2 = pd.merge(fin_df, loc_df[use_cols], how='left', left_index=True, right_index=True).fillna(value=x.iloc[0])

Unfortunately, this is not filling the nan values with the correct information. Is there a way to map the correct values to replace the missing nan values with the correct location information?

Edit -- Adding what I would like:

 BldgID  | Assmnt Phase | Funding Amt | State |  City           
--------------------------------------------------------          
  1         Phase 1       $$$$$$$$       CO     Denver       
  2         Phase 1       $$$$$$$$       MN     St. Paul       
  2         Phase 2       $$$$$$$$       MN     St. Paul        
  3         Phase 1       $$$$$$$$       NV     Reno       
  3         Phase 2       $$$$$$$$       NV     Reno
  4         Phase 2       $$$$$$$$       FL     Miami
  4         Phase 3       $$$$$$$$       FL     Miami

The nan values should be replaced with the correct location information.

If I understand correctly what you're trying to do, you're going the long way around to get there.

With pandas.merge , you can merge on a specific column. So, given your two DataFrames are as they are shown, you could do:

pd.merge(fin_df, loc_df, on = 'BldgID')

Which results in:

   BldgID Assmnt Phase Funding Amt State     City
0       1      Phase 1   $$$$$$$$    CO   Denver
1       2      Phase 1   $$$$$$$$    MN  St.Paul
2       2      Phase 2   $$$$$$$$    MN  St.Paul
3       3      Phase 1   $$$$$$$$    NV     Reno
4       3      Phase 2   $$$$$$$$    NV     Reno
5       4      Phase 2   $$$$$$$$    FL    Miami
6       4      Phase 3   $$$$$$$$    FL    Miami

Please try Use outer instead. A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe

result = pd.merge(fin_df, loc_df, how='outer', on='BldgID')



 BldgID Assmnt Phase Funding Amt State     City
0       1      Phase 1   $$$$$$$$    CO   Denver
1       2      Phase 1   $$$$$$$$    MN  St.Paul
2       2      Phase 2   $$$$$$$$    MN  St.Paul
3       3      Phase 1   $$$$$$$$    NV     Reno
4       3      Phase 2   $$$$$$$$    NV     Reno
5       4      Phase 2   $$$$$$$$    FL    Miami
6       4      Phase 3   $$$$$$$$    FL    Miami

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM