[英]Group by year/month/day in pandas
假设有以下DataFrame
:
rng = pd.date_range('1/1/2011', periods=72, freq='H')
np.random.seed(10)
n = 10
df = pd.DataFrame(
{
"datetime": np.random.choice(rng,n),
"cat": np.random.choice(['a','b','b'], n),
"val": np.random.randint(0,5, size=n)
}
)
如果我现在groupby
:
gb = df.groupby(['cat','datetime']).sum()
我每小时得到每cat
的总数:
cat datetime val
a 2011-01-01 00:00:00 1
2011-01-01 09:00:00 3
2011-01-02 16:00:00 1
2011-01-03 16:00:00 1
b 2011-01-01 08:00:00 4
2011-01-01 15:00:00 3
2011-01-01 16:00:00 3
2011-01-02 04:00:00 4
2011-01-02 05:00:00 1
2011-01-02 12:00:00 4
但是,我希望有类似的东西:
cat datetime val
a 2011-01-01 4
2011-01-02 1
2011-01-03 1
b 2011-01-01 10
2011-01-02 9
我可以通过添加另一个名为date
列来获得所需的结果:
df['date'] = df.datetime.apply(pd.datetime.date)
然后做一个类似的groupby
: df.groupby(['cat','date']).sum()
。 但我感兴趣的是,有更多的pythonic方式吗? 另外,我可能想看看月份或年级。 那么,什么是正确的方法?
您可以尝试set_index
然后groupby
由cat
和date
:
import pandas as pd
import numpy as np
rng = pd.date_range('1/1/2011', periods=72, freq='H')
np.random.seed(10)
n = 10
df = pd.DataFrame(
{
"datetime": np.random.choice(rng,n),
"cat": np.random.choice(['a','b','b'], n),
"val": np.random.randint(0,5, size=n)
}
)
print df
cat datetime val
0 a 2011-01-01 09:00:00 3
1 b 2011-01-01 15:00:00 3
2 a 2011-01-03 16:00:00 1
3 b 2011-01-02 04:00:00 4
4 b 2011-01-02 05:00:00 1
5 b 2011-01-01 08:00:00 4
6 a 2011-01-01 00:00:00 1
7 a 2011-01-02 16:00:00 1
8 b 2011-01-02 12:00:00 4
9 b 2011-01-01 16:00:00 3
df = df.set_index('datetime')
gb = df.groupby(['cat', lambda x: x.date]).sum()
print gb
val
cat
a 2011-01-01 4
2011-01-02 1
2011-01-03 1
b 2011-01-01 10
2011-01-02 9
从您的中间结构中,您可以使用.unstack
来分隔类别,执行.resample
,然后再次.stack
以返回到原始表单:
In [126]: gb = df.groupby(['cat', 'datetime']).sum()
In [127]: gb.unstack(0)
Out[127]:
val
cat a b
datetime
2011-01-01 00:00:00 1.0 NaN
2011-01-01 08:00:00 NaN 4.0
2011-01-01 09:00:00 3.0 NaN
2011-01-01 15:00:00 NaN 3.0
2011-01-01 16:00:00 NaN 3.0
2011-01-02 04:00:00 NaN 4.0
2011-01-02 05:00:00 NaN 1.0
2011-01-02 12:00:00 NaN 4.0
2011-01-02 16:00:00 1.0 NaN
2011-01-03 16:00:00 1.0 NaN
In [128]: gb.unstack(0).resample("D").sum().stack()
Out[128]:
val
datetime cat
2011-01-01 a 4.0
b 10.0
2011-01-02 a 1.0
b 9.0
2011-01-03 a 1.0
编辑:对于其他重新采样频率(月,年等), pandas resample文档中有一个很好的选项列表
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.