I have a DataFrame containing a time series such as follow:
I would like to create multiple subsets of that DataFrame, that would each contain on week worth of data, spanning from Sunday 0am to Saturday 0am.
I can think of a way to do that with RRule from timeutil, but it seems there might be a more intuitive/direct method using Pandas Periods.
However I am quite new to it so not sure where to start looking. Ideally it would be something like:
Period= Sun 0am to Sat 0am
Subsets=[]
for Period in DataFrame:
Subsets.append(DataFrame[Period])
Something like that.....
data:
Pd.DataFrame(dict, columns=['timestamp','open','high','low','close','volume'])
dict={'volume': {Timestamp('2005-03-06 19:00:00'): 521.0, Timestamp('2005-03-06 20:00:00'): 234.0, Timestamp('2005-03-06 20:30:00'): 164.0, Timestamp('2005-03-06 21:00:00'): 99.0, Timestamp('2005-03-06 17:30:00'): 1603.0, Timestamp('2005-03-06 21:30:00'): 389.0, Timestamp('2005-03-06 18:00:00'): 590.0, Timestamp('2005-03-06 17:00:00'): 1668.0, Timestamp('2005-03-06 19:30:00'): 79.0, Timestamp('2005-03-06 18:30:00'): 213.0}, 'low': {Timestamp('2005-03-06 19:00:00'): 1226.25, Timestamp('2005-03-06 20:00:00'): 1226.0, Timestamp('2005-03-06 20:30:00'): 1226.0, Timestamp('2005-03-06 21:00:00'): 1226.0, Timestamp('2005-03-06 17:30:00'): 1225.75, Timestamp('2005-03-06 21:30:00'): 1225.5, Timestamp('2005-03-06 18:00:00'): 1226.75, Timestamp('2005-03-06 17:00:00'): 1225.0, Timestamp('2005-03-06 19:30:00'): 1226.25, Timestamp('2005-03-06 18:30:00'): 1226.75}, 'timestamp': {Timestamp('2005-03-06 19:00:00'): 732011.79166666663, Timestamp('2005-03-06 20:00:00'): 732011.83333333337, Timestamp('2005-03-06 20:30:00'): 732011.85416666663, Timestamp('2005-03-06 21:00:00'): 732011.875, Timestamp('2005-03-06 17:30:00'): 732011.72916666663, Timestamp('2005-03-06 21:30:00'): 732011.89583333337, Timestamp('2005-03-06 18:00:00'): 732011.75, Timestamp('2005-03-06 17:00:00'): 732011.70833333337, Timestamp('2005-03-06 19:30:00'): 732011.8125, Timestamp('2005-03-06 18:30:00'): 732011.77083333337}, 'open': {Timestamp('2005-03-06 19:00:00'): 1227.0, Timestamp('2005-03-06 20:00:00'): 1226.25, Timestamp('2005-03-06 20:30:00'): 1226.5, Timestamp('2005-03-06 21:00:00'): 1226.0, Timestamp('2005-03-06 17:30:00'): 1225.75, Timestamp('2005-03-06 21:30:00'): 1225.75, Timestamp('2005-03-06 18:00:00'): 1227.0, Timestamp('2005-03-06 17:00:00'): 1225.75, Timestamp('2005-03-06 19:30:00'): 1226.25, Timestamp('2005-03-06 18:30:00'): 1227.25}, 'high': {Timestamp('2005-03-06 19:00:00'): 1227.0, Timestamp('2005-03-06 20:00:00'): 1226.5, Timestamp('2005-03-06 20:30:00'): 1226.5, Timestamp('2005-03-06 21:00:00'): 1226.25, Timestamp('2005-03-06 17:30:00'): 1227.5, Timestamp('2005-03-06 21:30:00'): 1226.0, Timestamp('2005-03-06 18:00:00'): 1227.5, Timestamp('2005-03-06 17:00:00'): 1226.25, Timestamp('2005-03-06 19:30:00'): 1226.75, Timestamp('2005-03-06 18:30:00'): 1227.5}, 'close': {Timestamp('2005-03-06 19:00:00'): 1226.25, Timestamp('2005-03-06 20:00:00'): 1226.25, Timestamp('2005-03-06 20:30:00'): 1226.0, Timestamp('2005-03-06 21:00:00'): 1226.0, Timestamp('2005-03-06 17:30:00'): 1227.0, Timestamp('2005-03-06 21:30:00'): 1225.5, Timestamp('2005-03-06 18:00:00'): 1227.25, Timestamp('2005-03-06 17:00:00'): 1225.5, Timestamp('2005-03-06 19:30:00'): 1226.5, Timestamp('2005-03-06 18:30:00'): 1226.75}}
You can use:
#sample dataframe
start = pd.to_datetime('2016-12-28')
rng = pd.date_range(start, periods=100, freq='100min')
df = pd.DataFrame({'timestamp': rng, 'X': range(100),
'id': ['a'] * 30 + ['b'] * 30 + ['c'] * 40 })
df = df.set_index(['timestamp'])
#print (df)
First filter out weekends by dayofweek
with boolean indexing
if necessary:
#df = df[df.index.dayofweek < 5]
#print (df)
Then use period_range
with week frequency:
#first date in index
first_date = df.index[0]
#last date in index
last_date = df.index[-1]
per = pd.period_range(first_date,last_date, freq='W')
print (per)
PeriodIndex(['2016-12-26/2017-01-01',
'2017-01-02/2017-01-08'], dtype='period[W-SUN]', freq='W-SUN')
Last create Subsets
by list comprehension
with converting each period to_timestamp
and select values by loc
:
Subsets = [ df.loc[x.to_timestamp('D', how='s'): x.to_timestamp('D', how='e')] for x in per]
#print (Subsets)
If loc
cannot be used, because end-points are not included in Dataetimeindex
use boolean indexing
:
Subsets = [ df[(df.index > x.to_timestamp('D', how='s')) &
(df.index < x.to_timestamp('D', how='e'))] for x in per]
#print (Subsets)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.