[英]Pandas groupby and then count the occurrence of 0
From this table, I try to interpolate missing dates by the min/max weekly dates available in the dataframe.从该表中,我尝试通过数据框中可用的最小/最大每周日期来插入缺失的日期。 Then, calculate the occurrence of 0 sales for each category.然后,计算每个类别的销售额为 0 的发生率。
df=pd.DataFrame({'category_id': ['aaa','aaa','aaa','aaa','bbb','bbb','bbb','ccc','ccc'],
'week': ['2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26','2015-01-12', '2015-01-19', '2015-01-26','2015-01-05', '2015-01-12'],
'sales': [0,20,30,10,45,0,47,0,10]})
First step: Add missing weekly dates to all categories and fill 0 to missing dates ( Q1 : I'm not sure how to get this df_add_missing_dates result)第一步:将缺少的每周日期添加到所有类别,并在缺少的日期中填入 0( Q1 :我不确定如何获得此 df_add_missing_dates 结果)
# expected dates interpolation output
df_add_missing_dates=pd.DataFrame({'category_id': ['aaa','aaa','aaa','aaa','bbb','bbb','bbb','bbb','ccc','ccc','ccc','ccc'],
'week': ['2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26',
'2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26',
'2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26'],
'sales': [0,20,30,10,
0,45,0,47,
0,10,0,0]})
Second step: Count the occurrence of 0 weekly sales ( Q2 : How to aggregate the sales=0 for each category?)第二步:计算每周销售额为 0 的次数( Q2 :如何汇总每个类别的销售额=0?)
# expected final output
category_id | sales_0_count
aaa | 1
bbb | 2
ccc | 3
Current code and logics:当前代码和逻辑:
# convert string to datetime and set as index
df['week'] = pd.to_datetime(df['week'], format='%Y-%m-%d')
# find min/max weekly dates in the dataframe --> I couldn't add missing dates with 0 sales though
idx = pd.period_range(start=df.week.min(),end=df.week.max(),freq='W')
df = df.reindex(idx, fill_value=0).reset_index(drop=True)
df_add_missing_dates = df
# group by category to count how many times weekly sales is 0
IIUC, you can use pd.MultiIndex.from_products
with reindex
and fill_value = 0
then use a boolean matrix and groupby
with sum
: IIUC,可以使用pd.MultiIndex.from_products
与reindex
和fill_value = 0
然后使用布尔矩阵和groupby
与sum
:
idx = pd.MultiIndex.from_product([df['category_id'].unique(),
df['week'].unique()],
names=['category_id', 'week'])
df_missing = (df.set_index(['category_id', 'week'])
.reindex(idx, fill_value=0)
.reset_index())
df_missing
Output:输出:
category_id week sales
0 aaa 2015-01-05 0
1 aaa 2015-01-12 20
2 aaa 2015-01-19 30
3 aaa 2015-01-26 10
4 bbb 2015-01-05 0
5 bbb 2015-01-12 45
6 bbb 2015-01-19 0
7 bbb 2015-01-26 47
8 ccc 2015-01-05 0
9 ccc 2015-01-12 10
10 ccc 2015-01-19 0
11 ccc 2015-01-26 0
Now, group and sum:现在,分组和求和:
(df_missing == 0).groupby(df_missing['category_id'])['sales'].sum()
Output:输出:
category_id
aaa 1.0
bbb 2.0
ccc 3.0
Name: sales, dtype: float64
Not sure what the reindex part is for, but after the不确定 reindex 部分的用途,但在
df['week'] = pd.to_datetime(df['week'], format='%Y-%m-%d')
you could do:你可以这样做:
groupedDf = df.groupby(['category_id', pd.Grouper(key='week', freq='W-MON')])['sales'].sum().reset_index().sort_values('week')
zeroSalesWeek = groupedDf[groupedDf.sales == 0]
output:输出:
zeroSalesWeek
category_id week sales
0 aaa 2015-01-05 0
4 bbb 2015-01-05 0
8 ccc 2015-01-05 0
6 bbb 2015-01-19 0
10 ccc 2015-01-19 0
11 ccc 2015-01-26 0
to select a particular category_id you could try:要选择特定的 category_id,您可以尝试:
df[(df.sales == 0) & (df.category_id=='bbb')]
which would give you这会给你
category_id week sales
4 bbb 2015-01-05 0
6 bbb 2015-01-19 0
Furthermore, if you think this may be a little too time consuming you can always create a quick function to select a particular category_id like:此外,如果您认为这可能有点太耗时,您可以随时创建一个快速函数来选择特定的 category_id,例如:
def zeroGroupedDf(df, category_id):
category_id = str(category_id)
tempDf = df[(df.sales == 0) & (df.category_id==category_id)]
return tempDf
and call any category_id you want to make a new df such as:并调用您想要创建新 df 的任何 category_id,例如:
test = zeroGroupedDf(df, 'bbb')
test
category_id week sales
4 bbb 2015-01-05 0
6 bbb 2015-01-19 0
This will give you the expected output in a crude way:这将以粗略的方式为您提供预期的输出:
df_add_missing_dates[df_add_missing_dates.sales.eq(0)].groupby('category_id')['sales'].count()
If you want the actual dataframe you expected (though this could be done much better):如果您想要您期望的实际数据框(尽管这可以做得更好):
expected_output = df_add_missing_dates[df_add_missing_dates.sales.eq(0)].\
groupby('category_id',as_index=False)['sales'].count().\
rename({'sales':'sales_0_count'},axis=1)
I did it like this:我是这样做的:
dfz = df_add_missing_dates[df_add_missing_dates['sales']==0]
g = dfz.groupby(pd.Grouper(key='category_id'))
g['sales'].count()
category_id
aaa 1
bbb 2
ccc 3
Name: sales, dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.