简体   繁体   中英

Merging two pandas DataFrames based on numerical column with different value, preserving unique variables from each

I'm trying to merge two DataFrames where the id needs to match, the categorical variables from each DataFrame is preserved, and the sum total for each id/category is preserved. Sorry if the wording is bit unclear, essentially I am starting with two Dataframes that each assign a category to some number of IDs:

>>> print(df_a)
  id   cat_a  sum_a
0  A    blue    800
1  B    blue    500
2  B   green    500
3  C  yellow    550
4  D     red   1000

>>> print(df_b)
  id     cat_b  sum_b
0  A    square    700
1  A  triangle    100
2  B    circle    700
3  B  triangle    300
4  C  pentagon    550
5  D      line    800
6  D  triangle    200

Looking at id=B, in cat_a 500/1000 are blue, 500/1000 are green, and in cat_b 700/1000 are circles, 300/1000 are triangles.

Both DataFrames have the same totals for each ID:

>>>print(df_a.groupby('id')['sum_a'].sum() == df_b.groupby('id')['sum_b'].sum())
id
A    True
B    True
C    True
D    True

I want to create a new DataFrame, df_c, which combines the categories and distributes the sums in sum_c, such that the original sums are still in accordance with their original DataFrames. Here is a handmade example:

>>> print(df_c)
  id   cat_a     cat_b  sum_c
0  A    blue    square    700
1  A    blue  triangle    100
2  B    blue    circle    500
3  B   green    circle    200
4  B   green  triangle    300
5  C  yellow  pentagon    550
6  D     red      line    800
7  D     red  triangle    200

I can confirm df_c is correct by performing a groupby back into it's constituent Dataframes, and checking it matches the original:

>>> df_c2a = df_c.groupby(['id', 'cat_a'], as_index=False)['sum_c'].sum()
>>> print(np.all(df_a.values == df_c2a.values))
True

>>> df_c2b = df_c.groupby(['id', 'cat_b'], as_index=False)['sum_c'].sum()
>>> print(np.all(df_b.values == df_c2b.values))
True

Currently, I am stumped as to how I might create the third DataFrame, df_c out of the first two. Any suggestions on the best way to accomplish this?


I've tried doing a left merge on 'id', however I can't seem to get the sums to match

>>> df_c = df_a.merge(df_b, how='left', on='id')
>>> df_c['sum_c'] = df_c['sum_b']
>>> df_c = df_c.drop(['sum_a', 'sum_b'], axis=1)
>>> df_a_group = df_c.groupby(['id', 'cat_a'], as_index=False)['sum_c'].sum().reset_index(drop=True)
>>> print(df_a)
  id   cat_a  sum_a
0  A    blue    800
1  B    blue    500
2  B   green    500
3  C  yellow    550
4  D     red   1000
>>> print(df_a_group)
  id   cat_a  sum_c
0  A    blue    800
1  B    blue   1000
2  B   green   1000
3  C  yellow    550
4  D     red   1000

You can merge both dataframes:

df_c = df_a.merge(df_b, on = 'id', how = 'outer')
df_c['sum_c'] = df_c.apply(lambda x: x['sum_b']/2 if x['id'] == 'B' else x['sum_b'], axis = 1)

If your data looks different, you just need to compute how to distribute the sums. In your example it is only necessary for id B.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM