简体   繁体   中英

How would I combine Pandas DataFrames with slightly different columns

I'm working with a few DataFrames. One of them has some added columns, and some removed columns. The first column of both is a date column, and df2 picks up where the df1 left off, chronologically.

df1:
day     alice  bob
8/11    0      0
8/25    2      5
9/1     2      0

df2:
day     alice  charlie
9/12    1      1
9/25    2      3
9/1     2      1

I would like to combine them, to get one DataFrame with all dates and all columns present.

df3: 
day     alice  bob  charlie
8/11    0      0      0 
8/25    2      5      0 
9/1     2      0      0
9/12    1      0      1
9/25    2      0      3
9/1     2      0      1

When I combine my actual 2 DataFrames with pd.concat or pd append, i get the following error:

AssertionError: Number of manager items must equal union of block items
# manager items: 65, # tot_items: 66

Not sure what the issue could be. In the meantime, thanks and you're awesome. I can provide to you a more fleshed out example if necessary.

Solution is correct:

df = pd.concat([df1,df2]).fillna(0)

But here is error:

AssertionError: Number of manager items must equal union of block items # manager items: 65, # tot_items: 66

It means there are duplicated columns names, you can verify it:

print (df1.loc[:, df1.columns.duplicated(keep=False)])
print (df2.loc[:, df2.columns.duplicated(keep=False)])

If same values in columns is possible remove duplicated by:

df1 = df1.loc[:, ~df1.columns.duplicated()]
df2 = df2.loc[:, ~df2.columns.duplicated()]
df = pd.concat([df1,df2]).fillna(0)

I think this will work:

df3 = pd.concat([df1.set_index("day"), df2.set_index("day")]).fillna(value=0).reset_index()
df1['charlie']=0
df2['bob']=0
df=pd.concat([df1,df2]).reset_index().drop(['index'],axis=1)

Explanation: You can add the columns missing in the respective data frames. Now, if you concat(), no error will be raised.

In order to get a DataFrame with all dates and columns present, you can use DataFrame.combine_first .

import pandas as pd
import numpy as np

# Create df1
df1 = pd.DataFrame({
    'day': ['8/11', '8/25', '9/1'],
    'alice': [0, 2, 2],
    'bob': [0, 5, 0]
}).set_index('day')

# Create df2
df2 = pd.DataFrame({
    'day': ['9/12', '9/25', '9/1'],
    'alice': [1, 2, 2],
    'charlie': [1, 3, 1]
}).set_index('day')

# Do combine_first
df = df2.combine_first(df1).fillna(0)

Initial dataframes

df1

      alice  bob
day             
8/11      0    0
8/25      2    5
9/1       2    0

df2

      alice  charlie
day                 
9/12      1        1
9/25      2        3
9/1       2        1

Final dataframe df

      alice  bob  charlie
day                      
8/11    0.0  0.0      0.0
8/25    2.0  5.0      0.0
9/1     2.0  0.0      1.0
9/12    1.0  0.0      1.0
9/25    2.0  0.0      3.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM