简体   繁体   English

Pandas dataframe,按日期/月份分组并按类别计数

[英]Pandas dataframe, group by date/month and count by categories

I have a dataframe with this sort of structure:我有一个具有这种结构的 dataframe:

df = pd.DataFrame({ "name": ["Victor Hugo", "Emile Zola", "Paul Verlaine", "Charles Baudelaire"], "date_enrolled": ["2020-05-20 08:48:21+00:00", "2020-05-05 17:30:11+00:00", "2020-05-22 01:11:24+00:00", "2020-07-29 09:32:10+00:00"], "cursus": ["AAA", "AAA", "BBB", "AAA"] })

以更清晰的方式

I try to obtain something like that:我试图获得类似的东西:

period时期 AAA AAA BBB BBB
2020-05 2020-05 2 2 1 1
2020-06 2020-06 0 0 0 0
2020-07 2020-07 1 1 0 0

In short: each cursus in one column with the count of enrolled name, with a period of time (YYYY-MM or potentially other date grouping/format), and for all period of time, including those that are empty (like 2020-06 in my example)简而言之:每一列中的每个 cursus 都有注册名称的计数,一段时间(YYYY-MM 或可能的其他日期分组/格式),以及所有时间段,包括那些为空的时间段(如 2020-06在我的例子中)

I have done many tests, but none gives me satisfaction...我做了很多测试,但没有一个让我满意...

Thank you for any assistance.感谢您提供任何帮助。

Convert date_enrolled into YYYY-MM by using Series.dt.to_period and df.pivot_table and then add missing months by using df.reindex :使用Series.dt.to_perioddf.pivot_tabledate_enrolled转换为YYYY-MM ,然后使用df.reindex添加缺失的月份:

In [937]: df.date_enrolled = pd.to_datetime(df.date_enrolled).dt.to_period('M')

In [947]: ans = df.pivot_table(index='date_enrolled', columns='cursus', aggfunc='count', fill_value=0)

In [979]: ans = ans.reindex(pd.period_range(ans.index[0], ans.index[-1],freq='M'), fill_value=0)

In [980]: ans
Out[980]: 
        name    
cursus   AAA BBB
2020-05    2   1
2020-06    0   0
2020-07    1   0

Use crosstab with convert date_enrolled to months periods by Series.dt.to_period and then add missing months by DataFrame.reindex :使用crosstab ,通过 Series.dt.to_period 将date_enrolled转换为月份,然后通过Series.dt.to_period添加缺失的DataFrame.reindex

df['date_enrolled'] = pd.to_datetime(df['date_enrolled'])

df = pd.crosstab(df['date_enrolled'].dt.to_period('m'), df['cursus'])
        
df = df.reindex(pd.period_range(df.index.min(),df.index.max(), name='period'), fill_value=0)
print (df)
cursus   AAA  BBB
period           
2020-05    2    1
2020-06    0    0
2020-07    1    0

Or with DataFrame.asfreq :或使用DataFrame.asfreq

df['date_enrolled'] = pd.to_datetime(df['date_enrolled'])

df = (pd.crosstab(df['date_enrolled'].dt.to_period('m').dt.to_timestamp(), df['cursus'])
        .asfreq('MS', fill_value=0)
        .to_period('m'))
print (df)

cursus         AAA  BBB
date_enrolled          
2020-05          2    1
2020-06          0    0
2020-07          1    0

Last if necessary column from date_enrolled use:必要时使用date_enrolled的最后一列:

df = df.reset_index().rename_axis(None, axis=1)
print (df)

    period  AAA  BBB
0  2020-05    2    1
1  2020-06    0    0
2  2020-07    1    0

    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM