简体   繁体   English

Python、Pandas:使用 GroupBy.groups 描述将其应用于另一个分组

[英]Python, Pandas: Use the GroupBy.groups description to apply it to another grouping

Let's consider a DataFrame that contains 1 row of 2 values per each day of the month of Jan 2010:让我们考虑一个包含 2010 年 1 月每一天的 1 行 2 个值的 DataFrame:

date_range = pd.date_range(dt(2010,1,1), dt(2010,1,31), freq='1D')
df = pd.DataFrame(data = np.random.rand(len(date_range),2), index = date_range)

I split that DataFrame into a list of 5 DataFrames, each of them containing 1 week worth of data from the original: df_weeks = [g for n, g in df.groupby(pd.TimeGrouper('W'))]我将该数据帧拆分为 5 个数据帧的列表,每个数据帧都包含来自原始数据的 1 周数据: df_weeks = [g for n, g in df.groupby(pd.TimeGrouper('W'))]

If I type df.groupby(pd.TimeGrouper('W')).groups I can see a dict explaining how the groups are split:如果我输入df.groupby(pd.TimeGrouper('W')).groups我可以看到一个解释组如何拆分的字典:

{Timestamp('2010-01-03 00:00:00', freq='W-SUN'): 3,
 Timestamp('2010-01-10 00:00:00', freq='W-SUN'): 10,
 Timestamp('2010-01-17 00:00:00', freq='W-SUN'): 17,
 Timestamp('2010-01-24 00:00:00', freq='W-SUN'): 24,
 Timestamp('2010-01-31 00:00:00', freq='W-SUN'): 31}

Let's consider another TimeSerie that has sparser data yet overlaps with the first one:让我们考虑另一个具有稀疏数据但与第一个重叠的 TimeSerie:

observations = pd.DataFrame(data =np.random.rand(5,2), index = (dt(2010,1,12), dt(2010,1,18), dt(2010,1,20), dt(2010,1,22), dt(2010,1,28)))

If I run the same code obs_weeks = [g for n, g in observations.groupby(pd.TimeGrouper('W'))] , obviously it will returns less DataFrames in the list, as the data covers less span.如果我运行相同的代码obs_weeks = [g for n, g in observations.groupby(pd.TimeGrouper('W'))] ,显然它会在列表中返回更少的 DataFrames,因为数据覆盖的跨度更小。 observations.groupby(pd.TimeGrouper('W')).groups returns : observations.groupby(pd.TimeGrouper('W')).groups返回:

{Timestamp('2010-01-17 00:00:00', freq='W-SUN'): 1,
 Timestamp('2010-01-24 00:00:00', freq='W-SUN'): 4,
 Timestamp('2010-01-31 00:00:00', freq='W-SUN'): 5}

But would there be a way to reuse the groups of the first DataFrame.GroupBy to apply it to the second one?但是有没有办法重用第一个 DataFrame.GroupBy 的组以将其应用于第二个? ie that would mean in that specific case ending up with a variable obs_weeks containing 5 DataFrames spanning the same time range as df_weeks , 2 of them being empty ?即,这意味着在特定情况下,最终会得到一个变量obs_weeks其中包含 5 个数据帧,其时间范围与df_weeks相同,其中 2 个为空?

One simple solution to your problem would be to make sure the observations dataframe contains all the dates that the df dataframe does.解决您的问题的一种简单方法是确保观察数据框包含 df 数据框所做的所有日期。 You can do this with the reindex method.您可以使用reindex方法执行此操作。 You will then have the exact same groups.然后,您将拥有完全相同的组。 You can also use resample('W') instead of groupby(pd.Timegrouper('W'))您还可以使用resample('W')而不是groupby(pd.Timegrouper('W'))

obs2 = observations.reindex(df.index)

obs2.resample('W').groups

{Timestamp('2010-01-03 00:00:00', freq='W-SUN'): 3,
 Timestamp('2010-01-10 00:00:00', freq='W-SUN'): 10,
 Timestamp('2010-01-17 00:00:00', freq='W-SUN'): 17,
 Timestamp('2010-01-24 00:00:00', freq='W-SUN'): 24,
 Timestamp('2010-01-31 00:00:00', freq='W-SUN'): 31}

And if we do a simple aggregation like sum we can see the results of both frames如果我们做一个像 sum 这样的简单聚合,我们可以看到两帧的结果

df.resample('W').sum()

                 0         1
2010-01-03  1.990558  2.555191
2010-01-10  2.707777  3.771756
2010-01-17  2.799897  3.353363
2010-01-24  3.165479  2.778870
2010-01-31  4.946577  3.394211

And now with obs2 which has 2 missing groups现在有了obs2 ,它有 2 个缺失的组

obs2.resample('W').sum()

                   0         1
2010-01-03       NaN       NaN
2010-01-10       NaN       NaN
2010-01-17  0.172341  0.137136
2010-01-24  1.752472  2.375306
2010-01-31  0.711525  0.124271

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM