简体   繁体   English

如何在 python 中合并两个不同长度的数据帧

[英]How to merge two dataframes with different lengths in python

I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.我正在尝试合并两个很好的 DateFrame,它们分别由一列组成,但长度不同。

Could I please know how to merge them, maintaining the 'Week' indexing?我可以知道如何合并它们,维护“周”索引吗?

[df1] [df1]

Week              Coeff1      
1               -0.456662
1               -0.533774
1               -0.432871
1               -0.144993
1               -0.553376
...                   ...
53              -0.501221
53              -0.025225
53               1.529864
53               0.044380
53              -0.501221
[16713 rows x 1 columns]

[df2] [df2]

Week               Coeff    
1                 0.571707
1                 0.086152
1                 0.824832
1                -0.037042
1                 1.167451
...                    ...
53               -0.379374
53                1.076622
53               -0.547435
53               -0.638206
53                0.067848
[63265 rows x 1 columns]

I've tried this code:我试过这段代码:

df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3

But it gave me a new df (df3) with 13386431 rows × 2 columns但它给了我一个新的 df (df3) 13386431 行 × 2 列

Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.期望的结果:一个新的 df,它有 3 列(week、coeff1、coeff2),因为 df2 更长,我希望在 coeff1 中有一些 NaN 来填补空白。

I assume your output should look somewhat like this:我假设您的 output 应该看起来像这样:

Week星期 Coeff1系数1 Coeff2系数2
1 1 -0.456662 -0.456662 0.571707 0.571707
1 1 -0.533774 -0.533774 0.086152 0.086152
1 1 -0.432871 -0.432871 0.824832 0.824832
2 2 3 3 3 3
2 2 NaN 3 3

Don't mind the actual numbers though.不过不要介意实际数字。 The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique.问题是你不会通过在 Week 上的 join 来实现这一点,既不是 left 也不是 inner ,这是因为 Week-Index 不是唯一的。 So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.因此,在左连接中,pandas 将连接 df1 中每一行 df2.Week == 1 的所有 Coeff2-Values,其中 df1.Week == 1。这就是为什么你得到这些数百万行的原因。

I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!稍后我会尝试为您提供解决方法,但也许这可以帮助您从另一个角度考虑这个问题!

Now is later:现在是以后:

What you actually want to do is to concatenate the Dataframes "per week".您真正想要做的是“每周”连接数据帧。 You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:您可以通过每周迭代来实现这一点,创建一个 df_subset[week] 通过 axis=1 连接 df1[week] 和 df2[week] ,然后在 axis=0 上连接所有这些子集:

weekly_dfs=[]
for week in df1.Week.unique():
    sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
    sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
    concat_df = pd.concat([sub_df1, sub_df2], axis=1)
    concat_df["Week"] = week
    weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)

The last reset of the index is optional but I recommend it anyways!索引的最后一次重置是可选的,但我还是推荐它!

Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:根据您对该问题的最后评论,您可能希望连接而不是合并两个数据框:

df3 = pd.concat([df1,df2], ignore_index=True, axis=1)

The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.生成的DataFrame应该有63265行,并且需要一些工作才能使其达到所需的格式(删除添加的索引列,重命名剩余的列等),但pd.concat应该是一个好的开始。

According to pandas' merge documentation , you can use merge in a way like that:根据 pandas 的合并文档,您可以通过以下方式使用合并:

What you are looking for is a left join.您正在寻找的是左连接。 However, the default option is an inner join.但是,默认选项是内部联接。 You can change this by passing a different how argument:您可以通过传递不同的方式来更改此参数:

df2.merge(df1,how='left', left_on='Week', right_on='Week')

note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.请注意,这会将这些行保留在较大的 df 中,并在与较短的 df 合并时将 NaN 分配给它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM