[英]Pandas dataframe, group by date/month and count by categories
I have a dataframe with this sort of structure:我有一个具有这种结构的 dataframe:
df = pd.DataFrame({ "name": ["Victor Hugo", "Emile Zola", "Paul Verlaine", "Charles Baudelaire"], "date_enrolled": ["2020-05-20 08:48:21+00:00", "2020-05-05 17:30:11+00:00", "2020-05-22 01:11:24+00:00", "2020-07-29 09:32:10+00:00"], "cursus": ["AAA", "AAA", "BBB", "AAA"] })
I try to obtain something like that:我试图获得类似的东西:
period时期 | AAA AAA | BBB BBB |
---|---|---|
2020-05 2020-05 | 2 2 | 1 1 |
2020-06 2020-06 | 0 0 | 0 0 |
2020-07 2020-07 | 1 1 | 0 0 |
In short: each cursus in one column with the count of enrolled name, with a period of time (YYYY-MM or potentially other date grouping/format), and for all period of time, including those that are empty (like 2020-06 in my example)简而言之:每一列中的每个 cursus 都有注册名称的计数,一段时间(YYYY-MM 或可能的其他日期分组/格式),以及所有时间段,包括那些为空的时间段(如 2020-06在我的例子中)
I have done many tests, but none gives me satisfaction...我做了很多测试,但没有一个让我满意...
Thank you for any assistance.感谢您提供任何帮助。
Convert date_enrolled
into YYYY-MM
by using Series.dt.to_period
and df.pivot_table
and then add missing months by using df.reindex
:使用Series.dt.to_period
和df.pivot_table
将date_enrolled
转换为YYYY-MM
,然后使用df.reindex
添加缺失的月份:
In [937]: df.date_enrolled = pd.to_datetime(df.date_enrolled).dt.to_period('M')
In [947]: ans = df.pivot_table(index='date_enrolled', columns='cursus', aggfunc='count', fill_value=0)
In [979]: ans = ans.reindex(pd.period_range(ans.index[0], ans.index[-1],freq='M'), fill_value=0)
In [980]: ans
Out[980]:
name
cursus AAA BBB
2020-05 2 1
2020-06 0 0
2020-07 1 0
Use crosstab
with convert date_enrolled
to months periods by Series.dt.to_period
and then add missing months by DataFrame.reindex
:使用crosstab
,通过 Series.dt.to_period 将date_enrolled
转换为月份,然后通过Series.dt.to_period
添加缺失的DataFrame.reindex
:
df['date_enrolled'] = pd.to_datetime(df['date_enrolled'])
df = pd.crosstab(df['date_enrolled'].dt.to_period('m'), df['cursus'])
df = df.reindex(pd.period_range(df.index.min(),df.index.max(), name='period'), fill_value=0)
print (df)
cursus AAA BBB
period
2020-05 2 1
2020-06 0 0
2020-07 1 0
Or with DataFrame.asfreq
:或使用DataFrame.asfreq
:
df['date_enrolled'] = pd.to_datetime(df['date_enrolled'])
df = (pd.crosstab(df['date_enrolled'].dt.to_period('m').dt.to_timestamp(), df['cursus'])
.asfreq('MS', fill_value=0)
.to_period('m'))
print (df)
cursus AAA BBB
date_enrolled
2020-05 2 1
2020-06 0 0
2020-07 1 0
Last if necessary column from date_enrolled
use:必要时使用date_enrolled
的最后一列:
df = df.reset_index().rename_axis(None, axis=1)
print (df)
period AAA BBB
0 2020-05 2 1
1 2020-06 0 0
2 2020-07 1 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.