简体   繁体   English

Pandas 合并具有不同列的两个数据框

[英]Pandas merge two dataframes with different columns

I'm surely missing something simple here.我肯定在这里遗漏了一些简单的东西。 Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.尝试在 pandas 中合并两个数据框,它们的列名大多相同,但右侧的数据框有一些左侧没有的列,反之亦然。

>df_may

  id  quantity  attr_1  attr_2
0  1        20       0       1
1  2        23       1       1
2  3        19       1       1
3  4        19       0       0

>df_jun

  id  quantity  attr_1  attr_3
0  5         8       1       0
1  6        13       0       1
2  7        20       1       1
3  8        25       1       1

I've tried joining with an outer join:我尝试使用外部连接加入:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")

But that yields:但这会产生:

Left data columns not unique: Index([....

I've also specified a single column to join on ( on = "id" , eg), but that duplicates all columns except id like attr_1_x , attr_1_y , which is not ideal.我还指定了要加入的单个列(例如on = "id" ),但这会复制除id之外的所有列,例如attr_1_xattr_1_y ,这并不理想。 I've also passed the entire list of columns (there are many) to on :我还将列的整个列表(有很多)传递给on

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))

Which yields:产生:

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

What am I missing?我错过了什么? I'd like to get a df with all rows appended, and attr_1 , attr_2 , attr_3 populated where possible, NaN where they don't show up.我想获得一个附加所有行的 df ,并在可能的attr_3填充attr_1attr_2attr_3 ,在它们不显示的地方填充 NaN 。 This seems like a pretty typical workflow for data munging, but I'm stuck.这似乎是一个非常典型的数据处理工作流程,但我被困住了。

Thanks in advance.提前致谢。

I think in this case concat is what you want: 我想在这种情况下concat是你想要的:

In [12]:

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs. 通过传递axis=0 ,你将df堆叠在一起,我相信你想要的是然后产生NaN值,它们不在各自的dfs中。

The accepted answer will break if there are duplicate headers :如果有重复的标题,接受的答案将中断:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects. InvalidIndexError:重新索引仅对具有唯一值的索引对象有效。

For example, here A has 3x trial columns, which preventsconcat :例如,这里A有 3x trial列,这可以防止concat

A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
#    id  trial  trial  trial
# 0   3      1      4      1

B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
#    id  trial
# 0   5      9
# 1   2      6

pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects

To fix this, deduplicate the column names beforeconcat :要解决此问题,请在concat之前删除重复的列名

parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})

for df in [A, B]:
    df.columns = parser._maybe_dedup_names(df.columns) 

pd.concat([A, B], ignore_index=True)
#    id  trial  trial.1  trial.2
# 0   3      1        4        1
# 1   5      9      NaN      NaN
# 2   2      6      NaN      NaN

Or as a one-liner but less readable:或者作为单行但可读性较差:

pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)

Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})请注意,对于 pandas <1.3.0,请使用: parser = pd.io.parsers.ParserBase({})

I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join 今天我使用concat,append或merge中的任何一个都遇到了这个问题,我通过添加一个顺序编号的辅助列然后进行外连接来解决它

helper=1
for i in df1.index:
    df1.loc[i,'helper']=helper
    helper=helper+1
for i in df2.index:
    df2.loc[i,'helper']=helper
    helper=helper+1
df1.merge(df2,on='helper',how='outer')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM