[英]Python, Pandas: Use the GroupBy.groups description to apply it to another grouping
Let's consider a DataFrame that contains 1 row of 2 values per each day of the month of Jan 2010:让我们考虑一个包含 2010 年 1 月每一天的 1 行 2 个值的 DataFrame:
date_range = pd.date_range(dt(2010,1,1), dt(2010,1,31), freq='1D')
df = pd.DataFrame(data = np.random.rand(len(date_range),2), index = date_range)
I split that DataFrame into a list of 5 DataFrames, each of them containing 1 week worth of data from the original: df_weeks = [g for n, g in df.groupby(pd.TimeGrouper('W'))]
我将该数据帧拆分为 5 个数据帧的列表,每个数据帧都包含来自原始数据的 1 周数据:
df_weeks = [g for n, g in df.groupby(pd.TimeGrouper('W'))]
If I type df.groupby(pd.TimeGrouper('W')).groups
I can see a dict explaining how the groups are split:如果我输入
df.groupby(pd.TimeGrouper('W')).groups
我可以看到一个解释组如何拆分的字典:
{Timestamp('2010-01-03 00:00:00', freq='W-SUN'): 3,
Timestamp('2010-01-10 00:00:00', freq='W-SUN'): 10,
Timestamp('2010-01-17 00:00:00', freq='W-SUN'): 17,
Timestamp('2010-01-24 00:00:00', freq='W-SUN'): 24,
Timestamp('2010-01-31 00:00:00', freq='W-SUN'): 31}
Let's consider another TimeSerie that has sparser data yet overlaps with the first one:让我们考虑另一个具有稀疏数据但与第一个重叠的 TimeSerie:
observations = pd.DataFrame(data =np.random.rand(5,2), index = (dt(2010,1,12), dt(2010,1,18), dt(2010,1,20), dt(2010,1,22), dt(2010,1,28)))
If I run the same code obs_weeks = [g for n, g in observations.groupby(pd.TimeGrouper('W'))]
, obviously it will returns less DataFrames in the list, as the data covers less span.如果我运行相同的代码
obs_weeks = [g for n, g in observations.groupby(pd.TimeGrouper('W'))]
,显然它会在列表中返回更少的 DataFrames,因为数据覆盖的跨度更小。 observations.groupby(pd.TimeGrouper('W')).groups
returns : observations.groupby(pd.TimeGrouper('W')).groups
返回:
{Timestamp('2010-01-17 00:00:00', freq='W-SUN'): 1,
Timestamp('2010-01-24 00:00:00', freq='W-SUN'): 4,
Timestamp('2010-01-31 00:00:00', freq='W-SUN'): 5}
But would there be a way to reuse the groups of the first DataFrame.GroupBy to apply it to the second one?但是有没有办法重用第一个 DataFrame.GroupBy 的组以将其应用于第二个? ie that would mean in that specific case ending up with a variable
obs_weeks
containing 5 DataFrames spanning the same time range as df_weeks
, 2 of them being empty ?即,这意味着在特定情况下,最终会得到一个变量
obs_weeks
其中包含 5 个数据帧,其时间范围与df_weeks
相同,其中 2 个为空?
One simple solution to your problem would be to make sure the observations dataframe contains all the dates that the df dataframe does.解决您的问题的一种简单方法是确保观察数据框包含 df 数据框所做的所有日期。 You can do this with the
reindex
method.您可以使用
reindex
方法执行此操作。 You will then have the exact same groups.然后,您将拥有完全相同的组。 You can also use
resample('W')
instead of groupby(pd.Timegrouper('W'))
您还可以使用
resample('W')
而不是groupby(pd.Timegrouper('W'))
obs2 = observations.reindex(df.index)
obs2.resample('W').groups
{Timestamp('2010-01-03 00:00:00', freq='W-SUN'): 3,
Timestamp('2010-01-10 00:00:00', freq='W-SUN'): 10,
Timestamp('2010-01-17 00:00:00', freq='W-SUN'): 17,
Timestamp('2010-01-24 00:00:00', freq='W-SUN'): 24,
Timestamp('2010-01-31 00:00:00', freq='W-SUN'): 31}
And if we do a simple aggregation like sum we can see the results of both frames如果我们做一个像 sum 这样的简单聚合,我们可以看到两帧的结果
df.resample('W').sum()
0 1
2010-01-03 1.990558 2.555191
2010-01-10 2.707777 3.771756
2010-01-17 2.799897 3.353363
2010-01-24 3.165479 2.778870
2010-01-31 4.946577 3.394211
And now with obs2
which has 2 missing groups现在有了
obs2
,它有 2 个缺失的组
obs2.resample('W').sum()
0 1
2010-01-03 NaN NaN
2010-01-10 NaN NaN
2010-01-17 0.172341 0.137136
2010-01-24 1.752472 2.375306
2010-01-31 0.711525 0.124271
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.