简体   繁体   中英

Keeping track of how many observations fall within a fixed time window when time delta is not constant

I have a dataframe with observations indexed by time, but the time delta between observations is not constant.

df
>>>
    TimeStamp              x1        x2
1   2015-03-01 19:05:01    0.812     18.23
2   2015-03-01 19:22:17    0.121     13.91
3   2015-03-01 19:24:34    0.822     15.10
4   2015-03-01 19:28:53    0.093     22.38
5   2015-03-01 21:49:57    0.291     22.90
6   2015-03-01 23:59:01    0.672     23.12
7   2015-03-02 02:30:01    0.421     28.56
8   2015-03-02 02:30:01    0.591     31.72
9   2015-03-02 02:31:17    0.811     21.71
10  2015-03-02 04:37:19    0.142     16.39

I want to count the number of observations that fall within a fixed time window of each sample.

If my time window is 10 minutes, then I would want to count [0, 2, 1, 0, 0, 0, 2, 1, 0] because 0 samples were observed within 10 minutes of the first sample, 2 samples were observed within 10 minutes of the second sample, 1 sample was observed within 10 minutes of the third sample etc. There could be cases where two observations occur at the same time, but they are different observations (as with 7 and 8).

If my time window is 1 hour, then I would want to count [3, 2, 1, 0, 0, 0, 2, 1, 0] because 3 samples were observed within 1 hour of the first sample and so on.

I have a function that does this, but there are 2 problems; 1) It is very slow because it iterates over the data row-by-row and 2) Sometimes the returned counts are negative, which I find very strange because the timedelta is always >= 0.

import pandas as pd
import datetime as dt

def get_count(data: pd.DataFrame, window_hours: int, window_minutes: int) -> np.ndarray:
    # we only want to iterate to the sample that is within window_hours + window_minutes from the end
    last_sample = data["TimeStamp"].iloc[-1] - dt.timedelta(days=0, hours=window_hours, minutes=window_minutes)
    count = np.empty(len(data[data["TimeStamp"] <= last_sample]), dtype=int)
    i = 0
    for index, row in data[data["TimeStamp"] <= last_day].iterrows():
        idx = np.where(data["TimeStamp"] <= (row["TimeStamp"] + dt.timedelta(days=0, hours=window_hours, minutes=window_minutes)))[0][-1]
        tmp = idx - index
        count[i] = tmp
        i += 1
    return count

Is there a way to do this using pure pandas / numpy (avoiding for loops) so that it is faster, as well as giving the desired output which it seems my method does not?

  • using a mask and then count()
  • flexible, as in args to Timedelta
df = pd.read_csv(io.StringIO("""   TimeStamp              x1        x2
1   2015-03-01 19:05:01    0.812     18.23
2   2015-03-01 19:22:17    0.121     13.91
3   2015-03-01 19:24:34    0.822     15.10
4   2015-03-01 19:28:53    0.093     22.38
5   2015-03-01 21:49:57    0.291     22.90
6   2015-03-01 23:59:01    0.672     23.12
7   2015-03-02 02:30:01    0.421     28.56
8   2015-03-02 02:30:01    0.591     31.72
9   2015-03-02 02:31:17    0.811     21.71
10  2015-03-02 04:37:19    0.142     16.39"""), sep="\s\s+", engine="python")

df.TimeStamp = pd.to_datetime(df.TimeStamp)

def within(dfa, **kwargs):
    return dfa.TimeStamp.apply(lambda t: dfa.loc[dfa.TimeStamp.gt(t) & 
                                                 dfa.TimeStamp.le(t+pd.Timedelta(**kwargs)),
                                                 "TimeStamp"].count())

df["10min"] = within(df, minutes=10)
df["4hour"] = within(df, hours=4)

TimeStamp x1 x2 10min 4hour
1 2015-03-01 19:05:01 0.812 18.23 0 4
2 2015-03-01 19:22:17 0.121 13.91 2 3
3 2015-03-01 19:24:34 0.822 15.1 1 2
4 2015-03-01 19:28:53 0.093 22.38 0 1
5 2015-03-01 21:49:57 0.291 22.9 0 1
6 2015-03-01 23:59:01 0.672 23.12 0 3
7 2015-03-02 02:30:01 0.421 28.56 1 2
8 2015-03-02 02:30:01 0.591 31.72 1 2
9 2015-03-02 02:31:17 0.811 21.71 0 1
10 2015-03-02 04:37:19 0.142 16.39 0 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM