簡體   English   中英

如何合並兩個具有不同列索引級別的 Pandas 數據框?

[英]How do you merge two Pandas dataframes with different column index levels?

我想連接具有相同索引但不同列級別的兩個數據幀。 一個數據框有一個分層索引,另一個沒有。

print df1

              A_1               A_2               A_3                .....
              Value_V  Value_y  Value_V  Value_y  Value_V  Value_y

instance200   50       0        6500     1        50       0
instance201   100      0        6400     1        50       0

另一個:

print df2

              PV         Estimate

instance200   2002313    1231233
instance201   2134124    1124724

結果應該是這樣的:

             PV        Estimate   A_1               A_2               A_3                .....
                                  Value_V  Value_y  Value_V  Value_y  Value_V  Value_y

instance200  2002313   1231233    50       0        6500     1        50       0
instance201  2134124   1124724    100      0        6400     1        50       0

但是在幀上合並或連接會給我一個帶有一維列索引的 df ,如下所示:

             PV        Estimate   (A_1,Value_V) (A_1,Value_y) (A_2,Value_V) (A_2,Value_y)  .....


instance200  2002313   1231233    50             0             6500         1
instance201  2134124   1124724    100            0             6400         1 

如何從 df1 中保留分層索引?

也許使用好的 ole 分配:

df3 = df1.copy()
df3[df2.columns] = df2

產量

                A_1             A_2             A_3               PV Estimate
            Value_V Value_y Value_V Value_y Value_V Value_y                  
instance200      50       0    6500       1      50       0  2002313  1231233
instance201     100       0    6400       1      50       0  2134124  1124724

您可以通過使 df2 具有與 df1 相同的級別數來做到這一點:

In [11]: df1
Out[11]:
                A_1             A_2             A_3
            Value_V Value_y Value_V Value_y Value_V Value_y
instance200      50       0    6500       1      50       0
instance201     100       0    6400       1      50       0

In [12]: df2
Out[12]:
                  PV  Estimate
instance200  2002313   1231233
instance201  2134124   1124724

In [13]: df2.columns = pd.MultiIndex.from_arrays([df2.columns, [None] * len(df2.columns)])

In [14]: df2
Out[14]:
                  PV Estimate
                 NaN      NaN
instance200  2002313  1231233
instance201  2134124  1124724

現在您可以在不修改列名的情況下進行連接:

In [15]: pd.concat([df1, df2], axis=1)
Out[15]:
                A_1             A_2             A_3               PV Estimate
            Value_V Value_y Value_V Value_y Value_V Value_y      NaN      NaN
instance200      50       0    6500       1      50       0  2002313  1231233
instance201     100       0    6400       1      50       0  2134124  1124724

注意:要讓 df2 列首先使用pd.concat([df2, df1], axis=1)


也就是說,我不確定我能想到一個用例,將它們作為單獨的 DataFrames 保留實際上可能是一個更簡單的解決方案......!

更新(2020 年 1 月)我為此目的構建了一個函數,如下所示:

def concat( df1, df2 ):

  """
  Function concatenates two dataframes df1 snd df2 even if the two datafames
  have different number of hierarchical columns levels.

  In the case of one dataframe having more hierarchical columns levels than the
  other, blank string will be added to the upper hierarchical columns levels
  """

  nLevels1 = df1.columns.nlevels
  nLevels2 = df2.columns.nlevels
  diff     = nLevels2 - nLevels1

  mLevels  = max(nLevels1, nLevels2)

  if nLevels1 == nLevels2:
    # if the same simply concat as normal
    return pd.concat( [df1, df2 ], axis = 1 )

  elif nLevels1 < nLevels2:
    # if there is a difference expand smaller dataframe with black strings, then concat

    df_temp = df1.copy()
    new_cols  = [[""] * len( df1.columns )] * np.abs(diff)

    new_cols = join_lists( df1.columns, new_cols)
    df_temp.columns = new_cols

    concatonated = pd.concat( [df_temp, df2 ], axis = 1 )
    return concatonated

  elif nLevels1 > nLevels2:
    # same as above but for the other way around

    df_temp = df2.copy()


    new_cols = [[""] * len( df2.columns )] * np.abs(diff)
    new_cols = join_lists( df2.columns, new_cols)

    new_cols.append( df2.columns.to_list() )

    df_temp.columns = new_cols

    concatonated = pd.concat( [df1, df_temp ], axis = 1)

    return concatonated

現在,如果我們提供數據幀

gender  f  m
            
n       2  1
y       2  2

gender        f                         m             
age         old        young          old        young
location london paris london paris london paris london
                                                      
n             1     0      1     0      0     1      0
y             0     1      0     1      1     0      1

我們得到

             f                         m                   
            old        young          old        young      
         london paris london paris london paris london  f  m
                                                            
n             1     0      1     0      0     1      0  2  1
y             0     1      0     1      1     0      1  2  2

請注意,將來加入類別性別使它們處於同一級別可能會很好,但這主要是為了加入具有完全不同列的數據框。

我為pandas.concat函數制作了一個包裝器,它接受級別數不等的數據幀。

空層是從下面添加的。 優點是它允許使用df_cols.c (在下面的df_cols中)訪問系列,並且在打印時,明確表示'c'不是('CC', 'one')的子級別。

def concat(dfs, axis=0, *args, **kwargs):   
    """
    Wrapper for `pandas.concat'; concatenate pandas objects even if they have 
    unequal number of levels on concatenation axis.
    
    Levels containing empty strings are added from below (when concatenating along
    columns) or right (when concateniting along rows) to match the maximum number 
    found in the dataframes.
    
    Parameters
    ----------
    dfs : Iterable
        Dataframes that must be concatenated.
    axis : int, optional
        Axis along which concatenation must take place. The default is 0.

    Returns
    -------
    pd.DataFrame
        Concatenated Dataframe.
    
    Notes
    -----
    Any arguments and kwarguments are passed onto the `pandas.concat` function.
    
    See also
    --------
    pandas.concat
    """
    def index(df):
        return df.columns if axis==1 else df.index
    
    def add_levels(df):
        need = want - index(df).nlevels
        if need > 0:
            df = pd.concat([df], keys=[('',)*need], axis=axis) # prepend empty levels
            for i in range(want-need): # move empty levels to bottom
                df = df.swaplevel(i, i+need, axis=axis) 
        return df
    
    want = np.max([index(df).nlevels for df in dfs])    
    dfs = [add_levels(df) for df in dfs]
    return pd.concat(dfs, axis=axis, *args, **kwargs)

希望這對某人有幫助。

測試:

df1

   AA      BB      CC    
  one     one     one    
    a   b   a   b   a   b
0  91  63   2  59  26  93
1  34   4  73  55  16  66
2   2   6   9  15  51  95

df2

    c   d   e
0  68  49  69
1  35  53  71
2  68  75  54


df3

       c   d   e
i  x  27  83  53
   y  54  51   9
   z  41   1  24
ii x  44  76  54
   y  76  85  21
   z  83  82   6


df_cols = concat([df1, df2], axis=1)

df_cols

   AA      BB      CC       c   d   e
  one     one     one                
    a   b   a   b   a   b            
0  91  63   2  59  26  93  68  49  69
1  34   4  73  55  16  66  35  53  71
2   2   6   9  15  51  95  68  75  54


df_rows = concat([df2, df3])

df_rows

       c   d   e
0     68  49  69
1     35  53  71
2     68  75  54
i  x  27  83  53
   y  54  51   9
   z  41   1  24
ii x  44  76  54
   y  76  85  21
   z  83  82   6

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM