I'm trying to merge two DataFrames where the id needs to match, the categorical variables from each DataFrame is preserved, and the sum total for each id/category is preserved. Sorry if the wording is bit unclear, essentially I am starting with two Dataframes that each assign a category to some number of IDs:
>>> print(df_a)
id cat_a sum_a
0 A blue 800
1 B blue 500
2 B green 500
3 C yellow 550
4 D red 1000
>>> print(df_b)
id cat_b sum_b
0 A square 700
1 A triangle 100
2 B circle 700
3 B triangle 300
4 C pentagon 550
5 D line 800
6 D triangle 200
Looking at id=B, in cat_a 500/1000 are blue, 500/1000 are green, and in cat_b 700/1000 are circles, 300/1000 are triangles.
Both DataFrames have the same totals for each ID:
>>>print(df_a.groupby('id')['sum_a'].sum() == df_b.groupby('id')['sum_b'].sum())
id
A True
B True
C True
D True
I want to create a new DataFrame, df_c, which combines the categories and distributes the sums in sum_c, such that the original sums are still in accordance with their original DataFrames. Here is a handmade example:
>>> print(df_c)
id cat_a cat_b sum_c
0 A blue square 700
1 A blue triangle 100
2 B blue circle 500
3 B green circle 200
4 B green triangle 300
5 C yellow pentagon 550
6 D red line 800
7 D red triangle 200
I can confirm df_c is correct by performing a groupby back into it's constituent Dataframes, and checking it matches the original:
>>> df_c2a = df_c.groupby(['id', 'cat_a'], as_index=False)['sum_c'].sum()
>>> print(np.all(df_a.values == df_c2a.values))
True
>>> df_c2b = df_c.groupby(['id', 'cat_b'], as_index=False)['sum_c'].sum()
>>> print(np.all(df_b.values == df_c2b.values))
True
Currently, I am stumped as to how I might create the third DataFrame, df_c out of the first two. Any suggestions on the best way to accomplish this?
I've tried doing a left merge on 'id', however I can't seem to get the sums to match
>>> df_c = df_a.merge(df_b, how='left', on='id')
>>> df_c['sum_c'] = df_c['sum_b']
>>> df_c = df_c.drop(['sum_a', 'sum_b'], axis=1)
>>> df_a_group = df_c.groupby(['id', 'cat_a'], as_index=False)['sum_c'].sum().reset_index(drop=True)
>>> print(df_a)
id cat_a sum_a
0 A blue 800
1 B blue 500
2 B green 500
3 C yellow 550
4 D red 1000
>>> print(df_a_group)
id cat_a sum_c
0 A blue 800
1 B blue 1000
2 B green 1000
3 C yellow 550
4 D red 1000
You can merge both dataframes:
df_c = df_a.merge(df_b, on = 'id', how = 'outer')
df_c['sum_c'] = df_c.apply(lambda x: x['sum_b']/2 if x['id'] == 'B' else x['sum_b'], axis = 1)
If your data looks different, you just need to compute how to distribute the sums. In your example it is only necessary for id B.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.