简体   繁体   English

Pandas 在组内将每月数据重新采样为每周并拆分值

[英]Pandas Resample Monthly data to Weekly within Groups and Split Values

I have a dataframe, below:我有一个 dataframe,如下:

ID Date     Volume Sales
1  2020-02   10     4
1  2020-03   8      6
2  2020-02   6      8
2  2020-03   4      10

Is there an easy way to convert this to weekly data using resampling?有没有一种简单的方法可以使用重采样将其转换为每周数据? And dividing the volume and sales column by the number of weeks in the month?并将数量和销售列除以该月的周数?

I have started my process which code which looks like:我已经开始了我的过程,其中的代码如下所示:

import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('date')
grouped = df.groupby('ID').resmaple('W').ffill().reset_index() 
print(grouped)

After this step, I get an error message: cannot inset ID, already exists完成此步骤后,我收到一条错误消息:无法插入 ID,已存在

Also is there a code to use for finding the number of weeks in a month for dividing the volume and sales column by the number of weeks in the month.还有一个代码可用于查找一个月中的周数,以将数量和销售列除以该月的周数。

The Expected output is:预期的 output 是:

ID      Volume  Sales      Weeks
0   1      2.5    1.0     2020-02-02
0   1      2.5    1.0     2020-02-09
0   1      2.5    1.0     2020-02-16
0   1      2.5    1.0     2020-02-23
1   1      1.6    1.2     2020-03-01
1   1      1.6    1.2     2020-03-08
1   1      1.6    1.2     2020-03-15
1   1      1.6    1.2     2020-03-22
1   1      1.6    1.2     2020-03-29
2   2      1.5    2       2020-02-02
2   2      1.5    2       2020-02-09
2   2      1.5    2       2020-02-16
2   2      1.5    2       2020-02-23
3   2      0.8    2       2020-03-01
3   2      0.8    2       2020-03-08
3   2      0.8    2       2020-03-15
3   2      0.8    2       2020-03-22
3   2      0.8    2       2020-03-29

After review, a much simpler solution can be used.经过审查,可以使用更简单的解决方案。 Please refer to subsection labeled New Solution in Part 1 below.请参阅下面第 1 部分中标有新解决方案的小节。

This task requires multiple steps.此任务需要多个步骤。 Let's break it down as follows:让我们分解如下:

Part 1: Transform Date & Resample第 1 部分:转换日期和重新采样

New Solution新解决方案

With consideration that the weekly frequency required, being Sunday based (ie freq='W-SUN' ) is independent for each month and is not related to or affected by any adjacent month(s), we can directly use the year-month values in column Date to generate date ranges in weekly basis in one step rather than breaking into 2 steps by first generating daily date ranges from year-month and then resample the daily date ranges to weekly afterwards.考虑到所需的每周频率,基于星期日(即freq='W-SUN' )对于每个月都是独立的,并且与任何相邻月份无关或受任何相邻月份的影响,我们可以直接使用年月值在Date列中一步生成日期范围,而不是分成两步,首先生成从年到月的每日日期范围,然后将每日日期范围重新采样为每周。

The new program logics just needs to use pd.date_range() with freq='W' with the help of pd.offsets.MonthEnd() to generate weekly frequency for a month.新的程序逻辑只需要在pd.date_range()的帮助下使用freq='W'pd.offsets.MonthEnd()来生成一个月的每周频率。 Altogether, it does not need to call .resample() or .asfreq() like other solutions.总而言之,它不需要像其他解决方案那样调用.resample().asfreq() Effectively, the pd.date_range() with freq='W' is doing the resampling task for us.实际上,带有freq='W'pd.date_range()正在为我们执行重采样任务。

Here goes the codes:代码如下:

df['Weeks'] = df['Date'].map(lambda x: 
                             pd.date_range(
                                 start=pd.to_datetime(x), 
                                 end=(pd.to_datetime(x) + pd.offsets.MonthEnd()),
                                 freq='W'))

df = df.explode('Weeks')

Result:结果:

print(df)


   ID     Date  Volume  Sales      Weeks
0   1  2020-02      10      4 2020-02-02
0   1  2020-02      10      4 2020-02-09
0   1  2020-02      10      4 2020-02-16
0   1  2020-02      10      4 2020-02-23
1   1  2020-03       8      6 2020-03-01
1   1  2020-03       8      6 2020-03-08
1   1  2020-03       8      6 2020-03-15
1   1  2020-03       8      6 2020-03-22
1   1  2020-03       8      6 2020-03-29
2   2  2020-02       6      8 2020-02-02
2   2  2020-02       6      8 2020-02-09
2   2  2020-02       6      8 2020-02-16
2   2  2020-02       6      8 2020-02-23
3   2  2020-03       4     10 2020-03-01
3   2  2020-03       4     10 2020-03-08
3   2  2020-03       4     10 2020-03-15
3   2  2020-03       4     10 2020-03-22
3   2  2020-03       4     10 2020-03-29

By the 2 lines of codes above, we already get the required result for Part 1. We don't need to go through the complicated codes of .groupby() and .resample() in the old solution.通过上面的两行代码,我们已经得到了第 1 部分所需的结果。我们不需要通过旧解决方案中的.groupby().resample()的复杂代码来 go。

We can continue to go to Part 2. As we have not created the grouped object, we can either replace grouped by df in for the codes in Part 2 or add a new line grouped = df to continue.我们可以继续 go 到第 2 部分。由于我们尚未创建grouped object,我们可以将第 2 部分中的代码替换为df grouped或添加新行grouped = df继续。

Old Solution旧解决方案

We use pd.date_range() with freq='D' with the help of pd.offsets.MonthEnd() to produce daily entries for the full month.我们在pd.date_range()的帮助下使用带有freq='D'pd.offsets.MonthEnd()来生成整个月的每日条目。 Then transform these full month ranges to index before resampling to week frequency.然后将这些完整的月份范围转换为索引,然后再重新采样为周频率。 Resampled with closed='left' to exclude the unwanted week of 2020-04-05 produced under default resample() parameters.使用closed='left'重新采样以排除在默认resample()参数下产生的 2020-04-05 的不想要的一周。

df['Weeks'] = df['Date'].map(lambda x: 
                             pd.date_range(
                                 start=pd.to_datetime(x), 
                                 end=(pd.to_datetime(x) + pd.offsets.MonthEnd()),
                                 freq='D'))

df = df.explode('Weeks').set_index('Weeks')

grouped = (df.groupby(['ID', 'Date'], as_index=False)
             .resample('W', closed='left')
             .ffill().dropna().reset_index(-1))

Result:结果:

print(grouped)


       Weeks   ID     Date  Volume  Sales
0 2020-02-02  1.0  2020-02    10.0    4.0
0 2020-02-09  1.0  2020-02    10.0    4.0
0 2020-02-16  1.0  2020-02    10.0    4.0
0 2020-02-23  1.0  2020-02    10.0    4.0
1 2020-03-01  1.0  2020-03     8.0    6.0
1 2020-03-08  1.0  2020-03     8.0    6.0
1 2020-03-15  1.0  2020-03     8.0    6.0
1 2020-03-22  1.0  2020-03     8.0    6.0
1 2020-03-29  1.0  2020-03     8.0    6.0
2 2020-02-02  2.0  2020-02     6.0    8.0
2 2020-02-09  2.0  2020-02     6.0    8.0
2 2020-02-16  2.0  2020-02     6.0    8.0
2 2020-02-23  2.0  2020-02     6.0    8.0
3 2020-03-01  2.0  2020-03     4.0   10.0
3 2020-03-08  2.0  2020-03     4.0   10.0
3 2020-03-15  2.0  2020-03     4.0   10.0
3 2020-03-22  2.0  2020-03     4.0   10.0
3 2020-03-29  2.0  2020-03     4.0   10.0

Here, we retain the column Date for some use later.在这里,我们保留Date列以供以后使用。

Part 2: Divide Volume and Sales by number of weeks in month第 2 部分:将销量和销售额除以每月的周数

Here, the number of weeks in month used to divide the Volume and Sales figures should actually be the number of resampled weeks within the month as shown in the interim result above.在这里,用于划分 Volume 和 Sales 数据的月份周数实际上应该是该月内重新采样的周数,如上面的中间结果所示。

If we use the actual number of weeks, then for Feb 2020, because of leap year, it has 29 days in that month and thus it actually spans across 5 weeks instead of the 4 resampled weeks in the interim result above.如果我们使用实际的周数,那么对于 2020 年 2 月,由于闰年,该月有 29 天,因此它实际上跨越 5 周,而不是上述中间结果中的 4 个重新采样周。 Then it would cause inconsistent results because there are only 4 week entries above while we divide each Volume and Sales figure by 5.然后它会导致不一致的结果,因为上面只有 4 周的条目,而我们将每个 Volume 和 Sales 数字除以 5。

Let's go to the codes then:让我们 go 到代码然后:

We group by columns ID and Date and then divide each value in columns Volume and Sales by group size (ie number of resampled weeks).我们按IDDate列分组,然后将VolumeSales列中的每个值按组大小(即重新采样的周数)划分。

grouped[['Volume', 'Sales']] = (grouped.groupby(['ID', 'Date'])[['Volume', 'Sales']]
                                       .transform(lambda x: x / x.count()))

or simplified form using /= as follows:或使用/=的简化形式如下:

grouped[['Volume', 'Sales']] /= (grouped.groupby(['ID', 'Date'])[['Volume', 'Sales']]
                                        .transform('count'))

Result:结果:

print(grouped)


       Weeks   ID     Date  Volume  Sales
0 2020-02-02  1.0  2020-02     2.5    1.0
0 2020-02-09  1.0  2020-02     2.5    1.0
0 2020-02-16  1.0  2020-02     2.5    1.0
0 2020-02-23  1.0  2020-02     2.5    1.0
1 2020-03-01  1.0  2020-03     1.6    1.2
1 2020-03-08  1.0  2020-03     1.6    1.2
1 2020-03-15  1.0  2020-03     1.6    1.2
1 2020-03-22  1.0  2020-03     1.6    1.2
1 2020-03-29  1.0  2020-03     1.6    1.2
2 2020-02-02  2.0  2020-02     1.5    2.0
2 2020-02-09  2.0  2020-02     1.5    2.0
2 2020-02-16  2.0  2020-02     1.5    2.0
2 2020-02-23  2.0  2020-02     1.5    2.0
3 2020-03-01  2.0  2020-03     0.8    2.0
3 2020-03-08  2.0  2020-03     0.8    2.0
3 2020-03-15  2.0  2020-03     0.8    2.0
3 2020-03-22  2.0  2020-03     0.8    2.0
3 2020-03-29  2.0  2020-03     0.8    2.0

Optionally, you can do some cosmetic works to drop the column Date and rearrange column Weeks to your desired position if you like.或者,如果您愿意,您可以做一些修饰工作以删除列Date并将列重新排列Weeks到您想要的 position。

Edit: (Similarity and difference from other questions resampling from month to week)编辑:(与其他问题逐月重新采样的相似性和不同之处)

In this review, I have searched some other questions of similar titles and compared the questions and solutions.在这篇评论中,我搜索了一些类似标题的其他问题,并比较了问题和解决方案。

There is another question with similar requirement to split the monthly values equally to weekly values according to the number of weeks in the resampled month.还有另一个类似要求的问题,即根据重新采样月份中的周数将每月值平均分割为每周值。 In that question, the months are represented as the first date of the months and they are in datetime format and used as index in the dataframe while in this question, the months are represented as YYYY-MM which can be of string type.在该问题中,月份表示为月份的第一个日期,它们采用日期时间格式并用作 dataframe 中的索引,而在此问题中,月份表示为YYYY-MM ,可以是字符串类型。

A big and critical difference is that in that question, the last month period index 2018-05-01 with value 22644 was actually not processed.一个重大而关键的区别是,在该问题中,实际上没有处理值为 22644 的上个月期间索引 2018-05-01。 That is, the month of 2018-05 is not resampled into weeks in May 2018 and the value 22644 has never been processed to split into weekly proportions.也就是说,2018-05 月份不会在 2018 年 5 月重新采样为周,并且值 22644 从未被处理以拆分为每周比例。 The accepted solution using .asfreq() does not show any entry for 2018-05 at all and the other solution using .resample() still keeps one (un-resampled) entry for 2018-05 and the value 22644 is not split into weekly proportions.使用.asfreq()的已接受解决方案根本不显示 2018-05 的任何条目,而使用.resample()的另一个解决方案仍保留 2018-05 的一个(未重新采样)条目,并且值 22644 未拆分为每周比例。

However, in our question here, the last month listed in each group still needs to be resampled into weeks and values split equally for the resampled weeks.但是,在我们这里的问题中,每个组中列出的最后一个月仍然需要重新采样为周,并且重新采样的周的值平均分配。

Looking at the solution, my new solution makes no call to .resample() nor .asfreq() .查看解决方案,我的新解决方案没有调用.resample().asfreq() It just uses pd.date_range() with freq='W' with the help of pd.offsets.MonthEnd() to generate weekly frequency for a month based on 'YYYY-MM' values.它只是在pd.date_range()的帮助下使用带有freq='W'pd.offsets.MonthEnd()根据 'YYYY-MM' 值生成一个月的每周频率。 This is what I could not imagine of when I worked on the old solution making use of .resample()这是我在使用.resample()处理旧解决方案时无法想象的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM