I have two dataframes: data_df
and geo_dimension_df
.
I would like to take the index of geo_dimension_df
, which I renamed to id
, and make it a column on data_df
called geo_id
.
I'll be inserting both of these dataframes as tables into a database, and the id
columns will be their primary keys while geo_id
is a foreign key that will link data_df
to geo_dimension_df
.
As can be seen, the cbsa
and name
values can change over time. (Yuba City, CA -> Yuba City-Marysville, CA). Therefore, the geo_dimension_df
is all the unique combinations of cbsa
and name
.
I need to compare the cbsa
and name
values on both dataframes and then when matching set geo_dimension_df.id
as the data_df.geo_id
value.
I tried using merge
for a bit, but got confused, so now I'm trying with apply
and looking at it like an Excel vlookup across multiple column values, but having no luck. The following is my attempt, but it's a bit gibberish...
data_df['geo_id'] = data_df[['cbsa', 'name']]
.apply(
lambda x, y:
geo_dimension_df
.index[geo_dimension_df[['cbsa', 'name]]
.to_list()
== [x,y])
Below are the two original dataframes followed by the desired result. Thank you.
geo_dimension_df:
cbsa name
id
1 10180 Abilene, TX
2 10420 Akron, OH
3 10500 Albany, GA
4 10540 Albany, OR
5 10540 Albany-Lebanon, OR
...
519 49620 York-Hanover, PA
520 49660 Youngstown-Warren-Boardman, OH-PA
521 49700 Yuba City, CA
522 49700 Yuba City-Marysville, CA
523 49740 Yuma, AZ
data_df:
cbsa name month year units_total
id
1 10180 Abilene, TX 1 2004 22
2 10180 Abilene, TX 2 2004 12
3 10180 Abilene, TX 3 2004 44
4 10180 Abilene, TX 4 2004 32
5 10180 Abilene, TX 5 2004 21
...
67145 49740 Yuma, AZ 12 2018 68
67146 49740 Yuma, AZ 1 2019 86
67147 49740 Yuma, AZ 2 2019 99
67148 49740 Yuma, AZ 3 2019 99
67149 49740 Yuma, AZ 4 2019 94
Desired Outcome:
data_df (with geo_id foreign key column added):
cbsa name month year units_total geo_id
id
1 10180 Abilene, TX 1 2004 22 1
2 10180 Abilene, TX 2 2004 12 1
3 10180 Abilene, TX 3 2004 44 1
4 10180 Abilene, TX 4 2004 32 1
5 10180 Abilene, TX 5 2004 21 1
...
67145 49740 Yuma, AZ 12 2018 68 523
67146 49740 Yuma, AZ 1 2019 86 523
67147 49740 Yuma, AZ 2 2019 99 523
67148 49740 Yuma, AZ 3 2019 99 523
67149 49740 Yuma, AZ 4 2019 94 523
Note: I'll be dropping cbsa
and name
from data_df
after this, in case anybody is curious as to why I'm duplicating data.
First, because the index is not a proper column, make it a column so that it can be used in a later merge
:
geo_dimension_df['geo_id'] = geo_dimension_df.index
Next, join data_df
and geo_dimension_df
data_df = pd.merge(data_df,
geo_dimension_df['cbsa', 'name', 'geo_id'],
on=['cbsa', 'name'],
how='left')
Finally, drop the column you added to the geo_dimension_df
at the start:
geo_dimension_df.drop('geo_id', axis=1, inplace=True)
After doing this, geo_dimension_df
's index column, id
, will now appear on data_df
under the column geo_id
:
data_df:
cbsa name month year units_total geo_id
id
1 10180 Abilene, TX 1 2004 22 1
2 10180 Abilene, TX 2 2004 12 1
3 10180 Abilene, TX 3 2004 44 1
4 10180 Abilene, TX 4 2004 32 1
5 10180 Abilene, TX 5 2004 21 1
...
67145 49740 Yuma, AZ 12 2018 68 523
67146 49740 Yuma, AZ 1 2019 86 523
67147 49740 Yuma, AZ 2 2019 99 523
67148 49740 Yuma, AZ 3 2019 99 523
67149 49740 Yuma, AZ 4 2019 94 523
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.