在多个 groupby 组中包含一行

Question

我按小时对时间序列进行分组，以分别对每小时的数据执行操作：

import pandas as pd
from datetime import datetime, timedelta

x = [2, 2, 4, 2, 2, 0]
idx = pd.date_range(
    start=datetime(2019, 1, 1),
    end=datetime(2019, 1, 1, 2, 30),
    freq=timedelta(minutes=30),
)

s = pd.Series(x, index=idx)
hourly = s.groupby(lambda x: x.hour)

print(s)
print("summed:")
print(hourly.sum())

它产生：

2019-01-01 00:00:00    2
2019-01-01 00:30:00    2
2019-01-01 01:00:00    4
2019-01-01 01:30:00    2
2019-01-01 02:00:00    2
2019-01-01 02:30:00    0
Freq: 30T, dtype: int64
summed:
0    4
1    6
2    2
dtype: int64

正如预期的那样。

我现在想知道每小时时间序列下的区域，我可以使用numpy.trapz ：

import numpy as np

def series_trapz(series):
    hours = [i.timestamp() / 3600 for i in series.index]
    return np.trapz(series, x=hours)

print("Area under curve")
print(hourly.agg(series_trapz))

但要使其正常工作，组之间的边界必须出现在两个组中！

例如，第一组必须是：

2019-01-01 00:00:00    2
2019-01-01 00:30:00    2
2019-01-01 01:00:00    4

第二组必须是

2019-01-01 01:00:00    4
2019-01-01 01:30:00    2
2019-01-01 02:00:00    2

等等。

这完全可能使用pandas.groupby吗？

Answer 1

我认为您可以使用Series.repeat重复您的系列中组的限制：

r=(s.index.minute==0).astype(int)+1
new_s=s.repeat(r)
print(new_s)

2019-01-01 00:00:00    2
2019-01-01 00:30:00    2
2019-01-01 01:00:00    4
2019-01-01 01:00:00    4
2019-01-01 01:30:00    2
2019-01-01 02:00:00    2
2019-01-01 02:00:00    2
2019-01-01 02:30:00    0

然后你可以使用Series.groupby ：

groups=(new_s.index.to_series().shift(-1,fill_value=0).dt.minute!=0).cumsum()
for i,group in new_s.groupby(groups):
    print(group)
    print('-'*50)

Name: col1, dtype: int64
2019-01-01 00:00:00    2
2019-01-01 00:30:00    2
2019-01-01 01:00:00    4
Name: col1, dtype: int64
--------------------------------------------------
2019-01-01 01:00:00    4
2019-01-01 01:30:00    2
2019-01-01 02:00:00    2
Name: col1, dtype: int64
--------------------------------------------------
2019-01-01 02:00:00    2
2019-01-01 02:30:00    0
Name: col1, dtype: int64
--------------------------------------------------

Answer 2

我不认为你的np.trapz逻辑在这里完全正确，但我认为你可以通过.rolling(..., closed="both")得到你想要的.rolling(..., closed="both")这样间隔的端点总是包括：

In [366]: s.rolling("1H", closed="both").apply(np.trapz).iloc[::2]
Out[366]:
2019-01-01 00:00:00    0.0
2019-01-01 01:00:00    5.0
2019-01-01 02:00:00    5.0
Freq: 60T, dtype: float64

Answer 3

IIUC，这可以通过rolling手动解决：

hours = np.unique(s.index.floor('H'))

# the answer:
(s.add(s.shift())
  .mul(s.index.to_series()
        .diff()
        .dt.total_seconds()
        .div(3600)
      )
   .rolling('1H').sum()[hours]
)

输出：

2019-01-01 00:00:00    NaN
2019-01-01 01:00:00    5.0
2019-01-01 02:00:00    5.0
dtype: float64

在多个 groupby 组中包含一行

问题描述

3 个解决方案

解决方案1
1 2019-11-28 17:30:19

解决方案2
1 2019-11-28 17:37:07

解决方案3
0 2019-11-28 17:36:53

在多个 groupby 组中包含一行

问题描述

3 个解决方案

解决方案1 1 2019-11-28 17:30:19

解决方案2 1 2019-11-28 17:37:07

解决方案3 0 2019-11-28 17:36:53

解决方案1
1 2019-11-28 17:30:19

解决方案2
1 2019-11-28 17:37:07

解决方案3
0 2019-11-28 17:36:53