在groupby之后恢复DataFrame MultiIndex（从行和列）

Question

I have a dataframe that is multi indexed in that manner.我有一个以这种方式进行多索引的数据框。

                                  Value              Size
                           A               B      Market Cap
2019-07-01 AAPL         89.583458      9.328360  2.116356e+06
           AMGN         49.828466     10.058943  1.395518e+05
2019-10-01 AAPL         74.297570     11.237253  2.116356e+06
           AMGN         56.841946     10.237481  1.395518e+05
2019-12-31 AAPL         97.435257     14.736749  2.116356e+06
           AMGN         71.400903     12.859612  1.395518e+05

I want to apply a function to each of its columns, for each date (so the 89.583458 and 49.828466 go together, 9.328360 and 10.058943 go together, and so forth)我想对每个日期的每一列应用一个函数（所以 89.583458 和 49.828466 一起使用，9.328360 和 10.058943 一起使用，依此类推）

winsorized_df = pipeline_df.groupby(level=0, axis=0).apply(
                lambda level_0_col: level_0_col.groupby(level=1, axis=1).apply(
                    lambda series: mstats.winsorize(a=series, limits=winsorize_bounds))
            )

This gives me这给了我

                                              Market Cap  ...                             B
2019-07-01  [[139551.76568603513], [139551.76568603513]]  ...  [[49.828465616227064], [49.828465616227064]]
2019-10-01  [[139551.76568603513], [139551.76568603513]]  ...    [[56.84194615992103], [56.84194615992103]]
2019-12-31  [[139551.76568603513], [139551.76568603513]]  ...    [[71.40090272484755], [71.40090272484755]]

But now I need to recover the lost indices (to get back the same structure as the original), but failed at setting as_index=False , unstacking or using pd.MultiIndex.from_frame.但是现在我需要恢复丢失的索引（以恢复与原始结构相同的结构），但在设置as_index=False 、取消堆叠或使用 pd.MultiIndex.from_frame 时失败。 Any idea?任何的想法？ Perhaps there's a better to get exactly that from the groupby call?也许有更好的方法从groupby电话中得到准确的信息？

Answer 1

The problem is that winsorize returns a numpy array.问题是winsorize返回一个 numpy 数组。 So you're replacing a dataframe with a numpy array (wich is why you see [[...]] in your output).因此，您正在用 numpy 数组替换数据帧（这就是为什么您在输出中看到[[...]]的原因）。 Instead, you should replace the values of the dataframe.相反，您应该替换数据框的值。 Here is an example:下面是一个例子：

import pandas
from scipy.stats.mstats import winsorize

# Recreating your dataframe
data = [
    {"date": "2019-07-01", "group": "AAPL", "A": 89.583458, "B": 9.328360, "Market Cap": 2.116356e+06},
    {"date": "2019-07-01", "group": "AMGN", "A": 49.828466, "B": 10.058943, "Market Cap": 1.395518e+05},
    {"date": "2019-10-01", "group": "AAPL", "A": 74.297570, "B": 11.237253, "Market Cap": 2.116356e+06},
    {"date": "2019-10-01", "group": "AMGN", "A": 56.841946, "B": 10.237481, "Market Cap": 1.395518e+05},
    {"date": "2019-12-31", "group": "AAPL", "A": 97.435257, "B": 14.736749, "Market Cap": 2.116356e+06},
    {"date": "2019-12-31", "group": "AMGN", "A": 71.400903, "B": 12.859612, "Market Cap": 1.395518e+05},
]
index = [
    [pandas.to_datetime(line.get("date")) for line in data],
    [line.get("group") for line in data],
]
columns = [
    ["Value", "Value", "Size"],
    ["A", "B", "Market Cap"]
]
df = pandas.DataFrame(data=[[line.get("A"), line.get("B"), line.get("Market Cap")] for line in data], index=index, columns=columns)


# Your lambda function in a separate definition
def process_group(group):

    # Nested
    def _sub(sub):
        # winsorize returns an numpy array, sub is a dataframe; sub[:] replaces the "values" of the dataframe, not the dataframe itself
        sub[:] = winsorize(a=sub, limits=[0.4, 0.6])  # I didn't know your limits so I've guessed...
        return sub

    # Return the result of the processing on the nested group
    return group.groupby(level=1, axis=1).apply(_sub)

# Process the groups
df = df.groupby(level=0, axis=0).apply(process_group)

Output:输出：

                     Value                  Size
                         A          B Market Cap
2019-07-01 AAPL  49.828466   9.328360   139551.8
           AMGN  49.828466   9.328360   139551.8
2019-10-01 AAPL  56.841946  10.237481   139551.8
           AMGN  56.841946  10.237481   139551.8
2019-12-31 AAPL  71.400903  12.859612   139551.8
           AMGN  71.400903  12.859612   139551.8

在groupby之后恢复DataFrame MultiIndex（从行和列）

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-25 15:22:40

在groupby之后恢复DataFrame MultiIndex（从行和列）

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-25 15:22:40

解决方案1
1 已采纳 2020-11-25 15:22:40