[英]Recovering DataFrame MultiIndex (from both row and column) after groupby
I have a dataframe that is multi indexed in that manner.我有一个以这种方式进行多索引的数据框。
Value Size
A B Market Cap
2019-07-01 AAPL 89.583458 9.328360 2.116356e+06
AMGN 49.828466 10.058943 1.395518e+05
2019-10-01 AAPL 74.297570 11.237253 2.116356e+06
AMGN 56.841946 10.237481 1.395518e+05
2019-12-31 AAPL 97.435257 14.736749 2.116356e+06
AMGN 71.400903 12.859612 1.395518e+05
I want to apply a function to each of its columns, for each date (so the 89.583458 and 49.828466 go together, 9.328360 and 10.058943 go together, and so forth)我想对每个日期的每一列应用一个函数(所以 89.583458 和 49.828466 一起使用,9.328360 和 10.058943 一起使用,依此类推)
winsorized_df = pipeline_df.groupby(level=0, axis=0).apply(
lambda level_0_col: level_0_col.groupby(level=1, axis=1).apply(
lambda series: mstats.winsorize(a=series, limits=winsorize_bounds))
)
This gives me这给了我
Market Cap ... B
2019-07-01 [[139551.76568603513], [139551.76568603513]] ... [[49.828465616227064], [49.828465616227064]]
2019-10-01 [[139551.76568603513], [139551.76568603513]] ... [[56.84194615992103], [56.84194615992103]]
2019-12-31 [[139551.76568603513], [139551.76568603513]] ... [[71.40090272484755], [71.40090272484755]]
But now I need to recover the lost indices (to get back the same structure as the original), but failed at setting as_index=False
, unstacking or using pd.MultiIndex.from_frame.但是现在我需要恢复丢失的索引(以恢复与原始结构相同的结构),但在设置
as_index=False
、取消堆叠或使用 pd.MultiIndex.from_frame 时失败。 Any idea?任何的想法? Perhaps there's a better to get exactly that from the
groupby
call?也许有更好的方法从
groupby
电话中得到准确的信息?
The problem is that winsorize
returns a numpy array.问题是
winsorize
返回一个 numpy 数组。 So you're replacing a dataframe with a numpy array (wich is why you see [[...]]
in your output).因此,您正在用 numpy 数组替换数据帧(这就是为什么您在输出中看到
[[...]]
的原因)。 Instead, you should replace the values of the dataframe.相反,您应该替换数据框的值。 Here is an example:
下面是一个例子:
import pandas
from scipy.stats.mstats import winsorize
# Recreating your dataframe
data = [
{"date": "2019-07-01", "group": "AAPL", "A": 89.583458, "B": 9.328360, "Market Cap": 2.116356e+06},
{"date": "2019-07-01", "group": "AMGN", "A": 49.828466, "B": 10.058943, "Market Cap": 1.395518e+05},
{"date": "2019-10-01", "group": "AAPL", "A": 74.297570, "B": 11.237253, "Market Cap": 2.116356e+06},
{"date": "2019-10-01", "group": "AMGN", "A": 56.841946, "B": 10.237481, "Market Cap": 1.395518e+05},
{"date": "2019-12-31", "group": "AAPL", "A": 97.435257, "B": 14.736749, "Market Cap": 2.116356e+06},
{"date": "2019-12-31", "group": "AMGN", "A": 71.400903, "B": 12.859612, "Market Cap": 1.395518e+05},
]
index = [
[pandas.to_datetime(line.get("date")) for line in data],
[line.get("group") for line in data],
]
columns = [
["Value", "Value", "Size"],
["A", "B", "Market Cap"]
]
df = pandas.DataFrame(data=[[line.get("A"), line.get("B"), line.get("Market Cap")] for line in data], index=index, columns=columns)
# Your lambda function in a separate definition
def process_group(group):
# Nested
def _sub(sub):
# winsorize returns an numpy array, sub is a dataframe; sub[:] replaces the "values" of the dataframe, not the dataframe itself
sub[:] = winsorize(a=sub, limits=[0.4, 0.6]) # I didn't know your limits so I've guessed...
return sub
# Return the result of the processing on the nested group
return group.groupby(level=1, axis=1).apply(_sub)
# Process the groups
df = df.groupby(level=0, axis=0).apply(process_group)
Output:输出:
Value Size
A B Market Cap
2019-07-01 AAPL 49.828466 9.328360 139551.8
AMGN 49.828466 9.328360 139551.8
2019-10-01 AAPL 56.841946 10.237481 139551.8
AMGN 56.841946 10.237481 139551.8
2019-12-31 AAPL 71.400903 12.859612 139551.8
AMGN 71.400903 12.859612 139551.8
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.