简体   繁体   中英

How to merge two dataframes with different lengths in python

I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.

Could I please know how to merge them, maintaining the 'Week' indexing?

[df1]

Week              Coeff1      
1               -0.456662
1               -0.533774
1               -0.432871
1               -0.144993
1               -0.553376
...                   ...
53              -0.501221
53              -0.025225
53               1.529864
53               0.044380
53              -0.501221
[16713 rows x 1 columns]

[df2]

Week               Coeff    
1                 0.571707
1                 0.086152
1                 0.824832
1                -0.037042
1                 1.167451
...                    ...
53               -0.379374
53                1.076622
53               -0.547435
53               -0.638206
53                0.067848
[63265 rows x 1 columns]

I've tried this code:

df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3

But it gave me a new df (df3) with 13386431 rows × 2 columns

Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.

I assume your output should look somewhat like this:

Week Coeff1 Coeff2
1 -0.456662 0.571707
1 -0.533774 0.086152
1 -0.432871 0.824832
2 3 3
2 NaN 3

Don't mind the actual numbers though. The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique. So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.

I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!

Now is later:

What you actually want to do is to concatenate the Dataframes "per week". You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:

weekly_dfs=[]
for week in df1.Week.unique():
    sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
    sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
    concat_df = pd.concat([sub_df1, sub_df2], axis=1)
    concat_df["Week"] = week
    weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)

The last reset of the index is optional but I recommend it anyways!

Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:

df3 = pd.concat([df1,df2], ignore_index=True, axis=1)

The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.

According to pandas' merge documentation , you can use merge in a way like that:

What you are looking for is a left join. However, the default option is an inner join. You can change this by passing a different how argument:

df2.merge(df1,how='left', left_on='Week', right_on='Week')

note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM