简体   繁体   中英

Python pandas: Setting index value of dataframe to another dataframe as a column using multiple column conditions

I have two dataframes: data_df and geo_dimension_df .

I would like to take the index of geo_dimension_df , which I renamed to id , and make it a column on data_df called geo_id .

I'll be inserting both of these dataframes as tables into a database, and the id columns will be their primary keys while geo_id is a foreign key that will link data_df to geo_dimension_df .

As can be seen, the cbsa and name values can change over time. (Yuba City, CA -> Yuba City-Marysville, CA). Therefore, the geo_dimension_df is all the unique combinations of cbsa and name .

I need to compare the cbsa and name values on both dataframes and then when matching set geo_dimension_df.id as the data_df.geo_id value.

I tried using merge for a bit, but got confused, so now I'm trying with apply and looking at it like an Excel vlookup across multiple column values, but having no luck. The following is my attempt, but it's a bit gibberish...

data_df['geo_id'] = data_df[['cbsa', 'name']]
                        .apply(
                        lambda x, y: 
                        geo_dimension_df
                            .index[geo_dimension_df[['cbsa', 'name]]
                            .to_list()
                        == [x,y])

Below are the two original dataframes followed by the desired result. Thank you.

geo_dimension_df:

       cbsa                               name
id                           
  1   10180                        Abilene, TX
  2   10420                          Akron, OH
  3   10500                         Albany, GA
  4   10540                         Albany, OR
  5   10540                 Albany-Lebanon, OR
                     ...
519   49620                   York-Hanover, PA
520   49660  Youngstown-Warren-Boardman, OH-PA
521   49700                      Yuba City, CA
522   49700           Yuba City-Marysville, CA
523   49740                           Yuma, AZ

data_df:

             cbsa         name  month  year units_total
        id                                             
        1   10180  Abilene, TX      1  2004          22
        2   10180  Abilene, TX      2  2004          12
        3   10180  Abilene, TX      3  2004          44
        4   10180  Abilene, TX      4  2004          32
        5   10180  Abilene, TX      5  2004          21
                                 ...
    67145   49740  Yuma, AZ        12  2018          68
    67146   49740  Yuma, AZ         1  2019          86
    67147   49740  Yuma, AZ         2  2019          99
    67148   49740  Yuma, AZ         3  2019          99
    67149   49740  Yuma, AZ         4  2019          94

Desired Outcome:
data_df (with geo_id foreign key column added):

             cbsa         name  month  year units_total geo_id
        id                                             
        1   10180  Abilene, TX      1  2004          22      1
        2   10180  Abilene, TX      2  2004          12      1
        3   10180  Abilene, TX      3  2004          44      1
        4   10180  Abilene, TX      4  2004          32      1
        5   10180  Abilene, TX      5  2004          21      1
                                 ...
    67145   49740  Yuma, AZ        12  2018          68    523
    67146   49740  Yuma, AZ         1  2019          86    523
    67147   49740  Yuma, AZ         2  2019          99    523
    67148   49740  Yuma, AZ         3  2019          99    523
    67149   49740  Yuma, AZ         4  2019          94    523

Note: I'll be dropping cbsa and name from data_df after this, in case anybody is curious as to why I'm duplicating data.

First, because the index is not a proper column, make it a column so that it can be used in a later merge :

geo_dimension_df['geo_id'] = geo_dimension_df.index

Next, join data_df and geo_dimension_df

data_df = pd.merge(data_df, 
                   geo_dimension_df['cbsa', 'name', 'geo_id'],
                   on=['cbsa', 'name'],
                   how='left')  

Finally, drop the column you added to the geo_dimension_df at the start:

geo_dimension_df.drop('geo_id', axis=1, inplace=True)

After doing this, geo_dimension_df 's index column, id , will now appear on data_df under the column geo_id :

data_df:

         cbsa         name  month  year units_total geo_id
    id                                             
    1   10180  Abilene, TX      1  2004          22      1
    2   10180  Abilene, TX      2  2004          12      1
    3   10180  Abilene, TX      3  2004          44      1
    4   10180  Abilene, TX      4  2004          32      1
    5   10180  Abilene, TX      5  2004          21      1
                             ...
67145   49740  Yuma, AZ        12  2018          68    523
67146   49740  Yuma, AZ         1  2019          86    523
67147   49740  Yuma, AZ         2  2019          99    523
67148   49740  Yuma, AZ         3  2019          99    523
67149   49740  Yuma, AZ         4  2019          94    523

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM