Merge multiple DataFrames Pandas

Question

This might be considered as a duplicate ofa thorough explanation of various approaches , however I can't seem to find a solution to my problem there due to a higher number of Data Frames.

I have multiple Data Frames (more than 10), each differing in one column VARX . This is just a quick and oversimplified example:

import pandas as pd

df1 = pd.DataFrame({'depth': [0.500000, 0.600000, 1.300000],
       'VAR1': [38.196202, 38.198002, 38.200001],
       'profile': ['profile_1', 'profile_1','profile_1']})

df2 = pd.DataFrame({'depth': [0.600000, 1.100000, 1.200000],
       'VAR2': [0.20440, 0.20442, 0.20446],
       'profile': ['profile_1', 'profile_1','profile_1']})

df3 = pd.DataFrame({'depth': [1.200000, 1.300000, 1.400000],
       'VAR3': [15.1880, 15.1820, 15.1820],
       'profile': ['profile_1', 'profile_1','profile_1']})

Each df has same or different depths for the same profiles, so

I need to create a new DataFrame which would merge all separate ones, where the key columns for the operation are depth and profile , with all appearing depth values for each profile.

The VARX value should be therefore NaN where there is no depth measurement of that variable for that profile.

The result should be a thus a new, compressed DataFrame with all VARX as additional columns to the depth and profile ones, something like this:

name_profile    depth   VAR1        VAR2        VAR3
profile_1   0.500000    38.196202   NaN         NaN
profile_1   0.600000    38.198002   0.20440     NaN
profile_1   1.100000    NaN         0.20442     NaN
profile_1   1.200000    NaN         0.20446     15.1880
profile_1   1.300000    38.200001   NaN         15.1820
profile_1   1.400000    NaN         NaN         15.1820

Note that the actual number of profiles is much, much bigger.

Any ideas?

Answer 1

Consider setting index on each data frame and then run the horizontal merge with pd.concat :

dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]]

print(pd.concat(dfs, axis=1).reset_index())
#      profile  depth       VAR1     VAR2    VAR3
# 0  profile_1    0.5  38.198002      NaN     NaN
# 1  profile_1    0.6  38.198002  0.20440     NaN
# 2  profile_1    1.1        NaN  0.20442     NaN
# 3  profile_1    1.2        NaN  0.20446  15.188
# 4  profile_1    1.3  38.200001      NaN  15.182
# 5  profile_1    1.4        NaN      NaN  15.182

Answer 2

A simple way is with a combination of functools.partial / reduce .

Firstly partial allows to "freeze" some portion of a function's arguments and/or keywords resulting in a new object with a simplified signature. Then with reduce we can apply cumulatively the new partial object to the items of iterable (list of dataframes here):

from functools import partial, reduce

dfs = [df1, df2, df3]
merge = partial(pd.merge, on=['depth', 'profile'], how='outer')
reduce(merge, dfs)

   depth       VAR1    profile     VAR2    VAR3
0    0.6  38.198002  profile_1  0.20440     NaN
1    0.6  38.198002  profile_1  0.20440     NaN
2    1.3  38.200001  profile_1      NaN  15.182
3    1.1        NaN  profile_1  0.20442     NaN
4    1.2        NaN  profile_1  0.20446  15.188
5    1.4        NaN  profile_1      NaN  15.182

Answer 3

I would use append.

>>> df1.append(df2).append(df3).sort_values('depth')

        VAR1     VAR2    VAR3  depth    profile
0  38.196202      NaN     NaN    0.5  profile_1
1  38.198002      NaN     NaN    0.6  profile_1
0        NaN  0.20440     NaN    0.6  profile_1
1        NaN  0.20442     NaN    1.1  profile_1
2        NaN  0.20446     NaN    1.2  profile_1
0        NaN      NaN  15.188    1.2  profile_1
2  38.200001      NaN     NaN    1.3  profile_1
1        NaN      NaN  15.182    1.3  profile_1
2        NaN      NaN  15.182    1.4  profile_1

Obviously if you have a lot of dataframes, just make a list and loop through them.

Answer 4

Why not concatenate all the Data Frames, melt, then reform them using your ids? There might be a more efficient way to do this, but this works.

df=pd.melt(pd.concat([df1,df2,df3]),id_vars=['profile','depth'])
df_pivot=df.pivot_table(index=['profile','depth'],columns='variable',values='value')

Where df_pivot will be

variable              VAR1     VAR2    VAR3
profile   depth                            
profile_1 0.5    38.196202      NaN     NaN
          0.6    38.198002  0.20440     NaN
          1.1          NaN  0.20442     NaN
          1.2          NaN  0.20446  15.188
          1.3    38.200001      NaN  15.182
          1.4          NaN      NaN  15.182

Answer 5

You can also use:

dfs = [df1, df2, df3]
df = pd.merge(dfs[0], dfs[1], left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
for d in dfs[2:]:
    df = pd.merge(df, d, left_on=['depth','profile'], right_on=['depth','profile'], how='outer')

   depth       VAR1    profile     VAR2    VAR3
0    0.5  38.196202  profile_1      NaN     NaN
1    0.6  38.198002  profile_1  0.20440     NaN
2    1.3  38.200001  profile_1      NaN  15.182
3    1.1        NaN  profile_1  0.20442     NaN
4    1.2        NaN  profile_1  0.20446  15.188
5    1.4        NaN  profile_1      NaN  15.182

Merge multiple DataFrames Pandas

Question

5 answers

solution1
16 ACCPTED 2019-04-12 13:45:36

solution2
14 2019-04-12 13:47:20

solution3
1 2019-04-12 13:52:53

solution4
1 2019-04-12 13:59:55

solution5
1 2019-04-12 14:23:43

Merge multiple DataFrames Pandas

Question

5 answers

solution1 16 ACCPTED 2019-04-12 13:45:36

solution2 14 2019-04-12 13:47:20

solution3 1 2019-04-12 13:52:53

solution4 1 2019-04-12 13:59:55

solution5 1 2019-04-12 14:23:43

solution1
16 ACCPTED 2019-04-12 13:45:36

solution2
14 2019-04-12 13:47:20

solution3
1 2019-04-12 13:52:53

solution4
1 2019-04-12 13:59:55

solution5
1 2019-04-12 14:23:43