简体   繁体   中英

How Can I Detect Gaps and Consecutive Periods In A Time Series In Pandas

I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?

Example of Dataframe with No Columns but a Date Index:

In [29]: import pandas as pd

In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])

In [31]: ts = pd.DataFrame(index=dates)

As you can see there is a gap from 2016-08-03 and 2016-09-19 . How do I detect these so I can create descriptive statistics, ie 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range . How I can detect these and also print descriptive stats?

Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.

Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object

from datetime import datetime, timedelta
import pandas as pd

# Construct dummy dataframe
dates = pd.to_datetime([
    '2016-08-03',
    '2016-08-04',
    '2016-08-05',
    '2016-08-17',
    '2016-09-05',
    '2016-09-06',
    '2016-09-07',
    '2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])

# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]

# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]

# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
    gap_start = df['date'][i - 1]
    print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
          f'Duration: {str(g.to_pytimedelta())}')

here's something to get started:

df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df =  df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()

The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:

print df.head()

                     ones
2016-08-03 00:00:00     0
2016-08-03 10:53:39     1
2016-08-04 00:00:00     1
2016-08-05 00:00:00     1
2016-08-06 00:00:00     1

print df.tail()
                     ones
2016-09-16 00:00:00     4
2016-09-17 00:00:00     4
2016-09-18 00:00:00     4
2016-09-19 00:00:00     4
2016-09-19 10:23:03     5

now to complete:

df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()

which gives:

              first_time  gaps
ones                          
0    2016-08-03 00:00:00     1
1    2016-08-03 10:53:39    34
2    2016-09-05 11:10:46     1
3    2016-09-05 11:11:30     2
4    2016-09-06 10:53:39    14
5    2016-09-19 10:23:03     1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM