简体   繁体   English

将pandas数据框映射到多个键上作为列或multiIndex

[英]Map pandas dataframe on multiple keys as columns or multiIndex

Setup: two pandas dataframes; 设置:两个熊猫数据框; data from df2 needs to be added to df1, as explained below: df2中的数据需要添加到df1中,如下所述:

  • df1 and df2 are multiIndexed with the same four levels df1和df2具有相同的四个级别的multiIndexed
  • df1 contains more rows than df2 df1比df2包含更多行
  • df1 has three copies (in rows) of a value per unique combination of three out of the four levels of the index; df1具有四个副本(按行),每个索引的四个级别中有三个级别的值的唯一组合; that is, each row differs only with respect to the 4th level 也就是说,每一行仅在第四级方面有所不同
  • df2 only partially aligns with df1 on the other 3 levels (df2 contains extraneous rows) df2仅在其他3个级别上与df1部分对齐(df2包含无关的行)
  • df2 contains only one column df2仅包含一列

I want to add values from the one column of df2 to all three copies of the rows in df1 where the three corresponding levels match. 我想将df2的一列中的值添加到df1中的三个对应级别匹配的行的所有三个副本中。

Having learned that 'merging with more than one level overlap on a multiIndex is not implemented' in pandas, I propose to map the values, but have not found a way to map on (multiple) index levels, or multiple columns, if reset index levels to columns: 了解到在熊猫中“未实现在multiIndex上合并多个层次的重叠没有实现”,我建议映射这些值,但是还没有找到一种方法来映射(多个)索引级别或多个列(如果重置索引)列的级别:

df1 = pd.DataFrame(np.array([['Dec', 'NY', 'Ren', 'Q1', 10],
   ['Dec', 'NY', 'Ren', 'Q2', 12],
   ['Dec', 'NY', 'Ren', 'Q3', 14],
   ['Dec', 'FL', 'Mia', 'Q1', 6],
   ['Dec', 'FL', 'Mia', 'Q2', 8],
   ['Dec', 'FL', 'Mia', 'Q3', 17],
   ['Apr', 'CA', 'SC', 'Q1', 1],
   ['Apr', 'CA', 'SC', 'Q2', 2],
   ['Apr', 'CA', 'SC', 'Q3', 3]]), columns=['Date', 'State', 'County', 'Quarter', 'x'])

df1.set_index(['Date', 'State', 'County', 'Quarter'], inplace=True)

df2 = pd.DataFrame(np.array([['Dec', 'NY', 'Ren', 0.4],
   ['Dec', 'FL', 'Mia', 0.3]]), columns=['Date', 'State', 'County', 'y'])

df2.set_index(['Date', 'State', 'County', 'y'], inplace=True)

df_combined = df1['Date', 'State', 'County'].map(df2)

You can temporarily change df1 to change the index to do the join: 您可以临时更改df1来更改索引以执行连接:

df_combined = df1.reset_index(3).join(df2,how='left')

>>> df_combined
           level_3   x    y
Apr CA SC       Q1   1  NaN
       SC       Q2   2  NaN
       SC       Q3   3  NaN
Dec FL Mia      Q1   6  0.3
       Mia      Q2   8  0.3
       Mia      Q3  17  0.3
    NY Ren      Q1  10  0.4
       Ren      Q2  12  0.4
       Ren      Q3  14  0.4

df_combined.set_index('level_3',append=True, inplace=True)
df_combined.index.rename(None,3,inplace=True)

>>> df_combined
                x    y
Apr CA SC  Q1   1  NaN
           Q2   2  NaN
           Q3   3  NaN
Dec FL Mia Q1   6  0.3
           Q2   8  0.3
           Q3  17  0.3
    NY Ren Q1  10  0.4
           Q2  12  0.4
           Q3  14  0.4

The reset_index method is used to temporarily turn the index that isn't in df2 into a column so that you can do a normal join. reset_index方法用于将不在df2的索引临时转换为列,以便您可以进行常规联接。 Then turn the column back into an index when you're done. 完成后,将列返回索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM