简体   繁体   中英

Joining dataframe whose columns have the same name

I would like to ask how to join (or merge) multiple dataframes (arbitrary number) whose columns may have the same name. I know this has been asked several times, but could not find a clear answer in any of the questions I have looked at.

import pickle
import os
from posixpath import join
import numpy as np
import pandas as pd
import re
import pickle

np.random.seed(1)
n_cols = 3
col_names  = ["Ci"] + ["C"+str(i) for i in range(n_cols)]
def get_random_df():
    values = np.random.randint(0, 10, size=(4,n_cols))
    index = np.arange(4).reshape([4,-1])
    return pd.DataFrame(np.concatenate([index, values], axis=1), columns=col_names).set_index("Ci")

dfs = []
for i in range(3):
    dfs.append(get_random_df())
    
print(dfs[0])
print(dfs[1])

with output:

    C0  C1  C2
Ci            
0    5   8   9
1    5   0   0
2    1   7   6
3    9   2   4
    C0  C1  C2
Ci            
0    5   2   4
1    2   4   7
2    7   9   1
3    7   0   6

If I try and join two dataframes per iteration:

# concanenate two per iteration
df = dfs[0]
for df_ in dfs[1:]:
    df = df.join(df_, how="outer", rsuffix="_r")
print("** 1 **")
print(df)

the final dataframe has columns with the same name: for example, C0_r is repeated for each joined dataframe.

** 1 **
    C0  C1  C2  C0_r  C1_r  C2_r  C0_r  C1_r  C2_r
Ci                                                
0    5   8   9     5     2     4     9     9     7
1    5   0   0     2     4     7     6     9     1
2    1   7   6     7     9     1     0     1     8
3    9   2   4     7     0     6     8     3     9

This could be easily solved by providing a different suffix per iteration. However, [the doc on join] says 1 " Efficiently join multiple DataFrame objects by index at once by passing a list.". If I try what follows:

# concatenate all at once
df = dfs[0].join(dfs[1:], how="outer")
# fails


# concatenate all at once
df = dfs[0].join(dfs[1:], how="outer", rsuffix="_r")
# fails

All steps fail due to duplicate columns:

 Indexes have overlapping values: Index(['C0', 'C1', 'C2'], dtype='object')

Question : is there a way to join automatically multiple dataframes without explicitly providing a different suffix every time?

Instead of join, concatenate along columns

# concatenate along columns
# use keys to differentiate different dfs
res = pd.concat(dfs, keys=range(len(dfs)), axis=1)
# flatten column names
res.columns = [f"{j}_{i}" for i,j in res.columns]
res

在此处输入图像描述

Wouldn't be more readable to display your data like this?

By adding this line of code at the end:

pd.concat([x for x in dfs], axis=1, keys=[f'DF{str(i+1)}' for i in range(len(dfs))])

#output

   DF1          DF2         DF3
   C0   C1  C2  C0  C1  C2  C0  C1  C2
Ci                                  
0   5   8   9   5   2   4   9   9   7
1   5   0   0   2   4   7   6   9   1
2   1   7   6   7   9   1   0   1   8
3   9   2   4   7   0   6   8   3   9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM