简体   繁体   English

pandas:合并(连接)多列上的两个数据框

[英]pandas: merge (join) two data frames on multiple columns

I am trying to join two pandas dataframes using two columns:我正在尝试使用两列连接两个 pandas 数据帧:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

but got the following error:但出现以下错误:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()

KeyError: '[B_1, c2]'

Any idea what should be the right way to do this?知道什么应该是正确的方法吗?

Try this尝试这个

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

left_on : label or list, or array-like Field names to join on in left DataFrame. left_on :标签或列表,或类似数组的字段名称,以加入左侧数据帧。 Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns可以是 DataFrame 长度的向量或向量列表,以使用特定向量作为连接键而不是列

right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs right_on :标签或列表,或类似数组的字段名称,以加入右侧数据帧或每个 left_on 文档的向量/向量列表

the problem here is that by using the apostrophes you are setting the value being passed to be a string, when in fact, as @Shijo stated from the documentation, the function is expecting a label or list, but not a string!这里的问题是,通过使用撇号,您将传递的值设置为字符串,而实际上,正如@Shijo 从文档中所述,该函数需要一个标签或列表,而不是一个字符串! If the list contains each of the name of the columns beings passed for both the left and right dataframe, then each column-name must individually be within apostrophes.如果列表包含为左右数据框传递的每个列的名称,则每个列名称必须单独位于撇号内。 With what has been stated, we can understand why this is inccorect:根据上述内容,我们可以理解为什么这是不正确的:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

And this is the correct way of using the function:这是使用该功能的正确方法:

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

另一种方法: new_df = A_df.merge(B_df, left_on=['A_c1','c2'], right_on = ['B_c1','c2'], how='left')

您可以使用以下简短易懂的内容:

merged_data= df1.merge(df2, on=["column1","column2"])
  1. It merges according to the ordering of left_on and right_on , ie, the i-th element of left_on will match with the i-th of right_on .它按照left_onright_on的顺序合并,即right_on的第 i 个元素将与left_on的第 i 个元素匹配。

    In the example below, the code on the top matches A_col1 with B_col1 and A_col2 with B_col2 , while the code on the bottom matches A_col1 with B_col2 and A_col2 with B_col1 .在下面的示例中,顶部的代码将A_col1B_col1以及A_col2B_col2进行匹配,而底部的代码将A_col1B_col2以及A_col2B_col1进行匹配。 Evidently, the results are different.显然,结果是不同的。

    资源1

  2. As can be seen from the above example, if the merge keys have different names, all keys will show up as their individual columns in the merged dataframe. In the example above, in the top dataframe, A_col1 and B_col1 are identical and A_col2 and B_col2 are identical.从上面的例子可以看出,如果合并键有不同的名称,所有的键将在合并后的dataframe中显示为它们各自的列。在上面的例子中,在顶部dataframe中, A_col1B_col1是相同的, A_col2B_col2是相同的。 In the bottom dataframe, A_col1 and B_col2 are identical and A_col2 and B_col1 are identical.在底部dataframe中, A_col1B_col2相同, A_col2B_col1相同。 Since these are duplicate columns, they are most likely not needed.由于这些是重复的列,因此很可能不需要它们。 One way to not have this problem from the beginning is to make the merge keys identical from the beginning.从一开始就没有这个问题的一种方法是使合并键从一开始就相同。 See bullet point #3 below.请参阅下面的要点#3。

  3. If left_on and right_on are the same col1 and col2 , we can use on=['col1', 'col2'] .如果left_onright_on是相同的col1col2 ,我们可以使用on=['col1', 'col2'] In this case, no merge keys are duplicated.在这种情况下,没有合并键被复制。

     df1.merge(df2, on=['col1', 'col2'])

    资源3

  4. You can also merge one side on column names and the other side on index too.您还可以合并列名的一侧和索引的另一侧。 For example, in the example below, df1 's columns are matched with df2 's indices.例如,在下面的示例中, df1的列与df2的索引匹配。 If the indices are named, as in the example below, you can reference them by name but if not, you can also use right_index=True (or left_index=True if the left dataframe is the one being merged on index).如果索引已命名,如下例所示,您可以通过名称引用它们,但如果没有,您也可以使用right_index=True (或者left_index=True ,如果 left dataframe 是在索引上合并的那个)。

     df1.merge(df2, left_on=['A_col1', 'A_col2'], right_index=True) # or df1.merge(df2, left_on=['A_col1', 'A_col2'], right_on=['B_col1', 'B_col2'])

    资源3

  5. By using the how= parameter, you can perform LEFT JOIN ( how='left' ), FULL OUTER JOIN ( how='outer' ) and RIGHT JOIN ( how='right' ) as well.通过使用how=参数,您还可以执行LEFT JOIN ( how='left' )、 FULL OUTER JOIN ( how='outer' ) 和RIGHT JOIN ( how='right' )。 The default is INNER JOIN ( how='inner' ) as in the examples above.默认值为INNER JOIN ( how='inner' ),如上例所示。

  6. If you have more than 2 dataframes to merge and the merge keys are the same across all of them, then join method is more efficient than merge because you can pass a list of dataframes and join on indices.如果要合并的数据帧超过 2 个,并且所有数据帧的合并键都相同,则join方法比merge更有效,因为您可以传递数据帧列表并连接索引。 Note that the index names are the same across all dataframes in the example below ( col1 and col2 ).请注意,在下面的示例中,索引名称在所有数据框中都是相同的( col1col2 )。 Note that the indices don't have to have names;请注意,索引不必有名称; if the indices don't have names, then the number of the multi-indices must match (in the case below there are 2 multi-indices).如果索引没有名称,则多索引的数量必须匹配(在下面的例子中有 2 个多索引)。 Again, as in bullet point #1, the match occurs according to the ordering of the indices.同样,与要点 #1 中一样,匹配是根据索引的顺序进行的。

     df1.join([df2, df3], how='inner').reset_index()

    资源4

this work for me, for n files xls这对我有用,适用于 n 个文件 xls

# all_reports_paths contain one array with all paths per files
for a in all_reports_paths:
    
    df.append( pd.read_excel(a,skiprows=X,skipfooter=X))

df_glob = pd.DataFrame(columns=columns)

for dataframe in df:

    df_glob = pd.concat([df_glob,pd.DataFrame(dataframe)],axis=0)

# finally df_glob contain all data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM