繁体   English   中英

在熊猫中按年/月/日分组

[英]Group by year/month/day in pandas

假设有以下DataFrame

rng = pd.date_range('1/1/2011', periods=72, freq='H')
np.random.seed(10)
n = 10
df = pd.DataFrame(
    {
        "datetime": np.random.choice(rng,n),
        "cat": np.random.choice(['a','b','b'], n),
        "val": np.random.randint(0,5, size=n)
        }
    )

如果我现在groupby

gb = df.groupby(['cat','datetime']).sum()

我每小时得到每cat的总数:

cat datetime            val
a   2011-01-01 00:00:00 1
    2011-01-01 09:00:00 3
    2011-01-02 16:00:00 1
    2011-01-03 16:00:00 1
b   2011-01-01 08:00:00 4
    2011-01-01 15:00:00 3
    2011-01-01 16:00:00 3
    2011-01-02 04:00:00 4
    2011-01-02 05:00:00 1
    2011-01-02 12:00:00 4

但是,我希望有类似的东西:

cat datetime   val
a   2011-01-01 4
    2011-01-02 1
    2011-01-03 1
b   2011-01-01 10
    2011-01-02 9

我可以通过添加另一个名为date列来获得所需的结果:

df['date'] = df.datetime.apply(pd.datetime.date)

然后做一个类似的groupbydf.groupby(['cat','date']).sum() 但我感兴趣的是,有更多的pythonic方式吗? 另外,我可能想看看月份或年级。 那么,什么是正确的方法?

您可以尝试set_index然后groupbycatdate

import pandas as pd
import numpy as np

rng = pd.date_range('1/1/2011', periods=72, freq='H')
np.random.seed(10)
n = 10
df = pd.DataFrame(
    {
        "datetime": np.random.choice(rng,n),
        "cat": np.random.choice(['a','b','b'], n),
        "val": np.random.randint(0,5, size=n)
        }
    )
print df
  cat            datetime  val
0   a 2011-01-01 09:00:00    3
1   b 2011-01-01 15:00:00    3
2   a 2011-01-03 16:00:00    1
3   b 2011-01-02 04:00:00    4
4   b 2011-01-02 05:00:00    1
5   b 2011-01-01 08:00:00    4
6   a 2011-01-01 00:00:00    1
7   a 2011-01-02 16:00:00    1
8   b 2011-01-02 12:00:00    4
9   b 2011-01-01 16:00:00    3
df = df.set_index('datetime')
gb = df.groupby(['cat', lambda x: x.date]).sum()
print gb
                val
cat                
a   2011-01-01    4
    2011-01-02    1
    2011-01-03    1
b   2011-01-01   10
    2011-01-02    9

从您的中间结构中,您可以使用.unstack来分隔类别,执行.resample ,然后再次.stack以返回到原始表单:

In [126]: gb = df.groupby(['cat', 'datetime']).sum()

In [127]: gb.unstack(0)
Out[127]:
                     val
cat                    a    b
datetime
2011-01-01 00:00:00  1.0  NaN
2011-01-01 08:00:00  NaN  4.0
2011-01-01 09:00:00  3.0  NaN
2011-01-01 15:00:00  NaN  3.0
2011-01-01 16:00:00  NaN  3.0
2011-01-02 04:00:00  NaN  4.0
2011-01-02 05:00:00  NaN  1.0
2011-01-02 12:00:00  NaN  4.0
2011-01-02 16:00:00  1.0  NaN
2011-01-03 16:00:00  1.0  NaN

In [128]: gb.unstack(0).resample("D").sum().stack()
Out[128]:
                 val
datetime   cat
2011-01-01 a     4.0
           b    10.0
2011-01-02 a     1.0
           b     9.0
2011-01-03 a     1.0

编辑:对于其他重新采样频率(月,年等), pandas resample文档中有一个很好的选项列表

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM