简体   繁体   中英

How do I split-apply-combine on a dataframe with multi-index columns?

df = pd.DataFrame([[100,90,80,70,36,45], [101,78,65,88,55,78], [92,77,42,79,43,32], [103,98,76,54,45,65]], index = pd.date_range(start='2022-01-01' ,periods=4))
df.columns = pd.MultiIndex.from_tuples((("mkf", "Open"), ("mkf", "Close"), ("tdf", "Open"), ("tdf","Close"), ("ghi","Open"), ("ghi", "Close")))
df

            mkf        tdf        ghi
           Open Close Open Close Open Close
2022-01-01  100    90   80    70   36    45
2022-01-02  101    78   65    88   55    78
2022-01-03   92    77   42    79   43    32
2022-01-04  103    98   76    54   45    65

The above data frame is composed of three smaller data frames that are identical in structure (ie same columns and indices). You can think of it as a large data frame that has been concatenated from three smaller ones. In reality there are thousands of these smaller df's concatenated together but for simplicity I've kept it at three.

I'd like to be able to use the split-apply-combine principle to create a new column called r in each of the three sub data frames.

The split would be based on df.columns.levels[0] . The apply would be np.log(['Close']).diff() The combined data frame would produce the below:

            mkf                  tdf                  ghi
           Open Close         r Open Close         r Open Close         r
2022-01-01  100    90       NaN   80    70       NaN   36    45       NaN
2022-01-02  101    78 -0.143101   65    88  0.228842   55    78  0.550046
2022-01-03   92    77 -0.012903   42    79 -0.107889   43    32 -0.890973
2022-01-04  103    98  0.241162   76    54 -0.380464   45    65  0.708651

You could use:

(pd.concat([df, 
            (np.log(df.xs('Close', level=1, axis=1, drop_level=False))
               .rename(columns={'Close': 'r'}, level=1).diff())
            ], axis=1)
  .sort_index(axis=1, level=0, sort_remaining=False)
)

or with a groupby approach:

(df.stack(0)
   .assign(r=lambda d:
           d.groupby(level=1)['Close'].apply(lambda s: np.log(s).diff())
           )
   .unstack(1).swaplevel(axis=1)
   .sort_index(axis=1, level=0, sort_remaining=False)
)

output:

            ghi                  mkf                  tdf                
           Open Close         r Open Close         r Open Close         r
2022-01-01   36    45       NaN  100    90       NaN   80    70       NaN
2022-01-02   55    78  0.550046  101    78 -0.143101   65    88  0.228842
2022-01-03   43    32 -0.890973   92    77 -0.012903   42    79 -0.107889
2022-01-04   45    65  0.708651  103    98  0.241162   76    54 -0.380464

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM