简体   繁体   中英

Pandas multi-index dataframe merge issue

I want to merge two dataframes, df1 and df2, which have multi-index columns:

mi1 = pd.MultiIndex.from_tuples([('id', '0'), ('car', '2018')], names=['variable', 'year'])
mi2 = pd.MultiIndex.from_tuples([('id', '0'), ('car', '2019')], names=['variable', 'year'])
df1 = pd.DataFrame([['alice', 1], ['bob', 2]], columns=mi1)
df2 = pd.DataFrame([['alice', 2], ['bob', 3]], columns=mi2)

In both df1 and df2, the first column index refers to a variable name, while the second index refers to a year. Some variables, like 'id' in this example, are not related to a specific year, hence the '0' value, which has no incidence here.

df1
variable     id  car
year          0 2018
0         alice    1
1           bob    2

df2
variable     id  car
year          0 2019
0         alice    2
1           bob    3

I would like to merge df1 and df2 to get:

variable     id  car  car
year          0 2018 2029
0         alice    1    2
1           bob    2    3

The problem is that the merge function using the 'id' column, applied to df1 and df2, returns an error message:

df3 = pd.merge(df1, df2, on=('id', '0'), how="outer")

Traceback (most recent call last):
  File "<input>", line 5, in <module>
  File "C:\Users\AA\AppData\Roaming\Python\Python37\site-packages\pandas\core\reshape\merge.py", line 87, in merge
    validate=validate,
  File "C:\Users\AA\AppData\Roaming\Python\Python37\site-packages\pandas\core\reshape\merge.py", line 652, in __init__
    ) = self._get_merge_keys()
  File "C:\Users\AA\AppData\Roaming\Python\Python37\site-packages\pandas\core\reshape\merge.py", line 1005, in _get_merge_keys
    right_keys.append(right._get_label_or_level_values(rk))
  File "C:\Users\AA\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 1580, in _get_label_or_level_values
    f"The {label_axis_name} label '{key}' "
ValueError: The column label 'id' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.

It is very surprising - and frustrating - because the 'on' paremeter of the merge function has a tuple as argument, so there shouldn't be an issue here. And I need to use a merge function because in reality, the dataframes to merge are more complex and don't have the same id columns.

Can you tell me how to solve this and merge two dataframes with multi-index columns?

The problem here is that on can use one or more columns to merge two dataframes

so when you pass on=('id', '0') it thinks you want to merge on two fields. Writing on=[('id', '0')] removes the ambiguity. One column to merge on and two labels specified as part of the multiindex:

df3 = pd.merge(df1, df2, on=[('id', '0')], how="outer")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM