简体   繁体   中英

Pandas how to split dataframe by column by interval

I have a gigantic dataframe with a datetime type column called dt , the data frame is sorted based on dt already. I want to split the dataframe into several dataframes based on dt , each dataframe contains rows within 1 hr range.

Split

   dt                    text
0  20160811 11:05        a
1  20160811 11:35        b
2  20160811 12:03        c
3  20160811 12:36        d
4  20160811 12:52        e
5  20160811 14:32        f

into

   dt                    text
0  20160811 11:05        a
1  20160811 11:35        b
2  20160811 12:03        c

   dt                    text
0  20160811 12:36        d
1  20160811 12:52        e

   dt                    text 
0  20160811 14:32        f

You need groupby by difference of first value of column dt converted to hour by astype :

S = pd.to_datetime(df.dt)
for i, g in df.groupby([(S - S[0]).astype('timedelta64[h]')]):
        print (g.reset_index(drop=True))

               dt text
0  20160811 11:05    a
1  20160811 11:35    b
2  20160811 12:03    c
               dt text
0  20160811 12:36    d
1  20160811 12:52    e
               dt text
0  20160811 14:32    f

List comprehension solution:

S = pd.to_datetime(df.dt)

print ((S - S[0]).astype('timedelta64[h]'))
0    0.0
1    0.0
2    0.0
3    1.0
4    1.0
5    3.0
Name: dt, dtype: float64

L = [g.reset_index(drop=True) for i, g in df.groupby([(S - S[0]).astype('timedelta64[h]')])]

print (L[0])
               dt text
0  20160811 11:05    a
1  20160811 11:35    b
2  20160811 12:03    c

print (L[1])
               dt text
0  20160811 12:36    d
1  20160811 12:52    e

print (L[2])
               dt text
0  20160811 14:32    f

Old solution, which split by hour :

You can use groupby by dt.hour , but first need convert dt to_datetime :

for i, g in df.groupby([pd.to_datetime(df.dt).dt.hour]):
    print (g.reset_index(drop=True))

               dt text
0  20160811 11:05    a
1  20160811 11:35    b
               dt text
0  20160811 12:03    c
1  20160811 12:36    d
2  20160811 12:52    e
               dt text
0  20160811 14:32    f

List comprehension solution:

L = [g.reset_index(drop=True) for i, g in df.groupby([pd.to_datetime(df.dt).dt.hour])]

print (L[0])
               dt text
0  20160811 11:05    a
1  20160811 11:35    b

print (L[1])
               dt text
0  20160811 12:03    c
1  20160811 12:36    d
2  20160811 12:52    e

print (L[2])
               dt text
0  20160811 14:32    f

Or use list comprehension with converting column dt to datetime :

df.dt = pd.to_datetime(df.dt)
L =[g.reset_index(drop=True) for i, g in df.groupby([df['dt'].dt.hour])]

print (L[1])
                   dt text
0 2016-08-11 12:03:00    c
1 2016-08-11 12:36:00    d
2 2016-08-11 12:52:00    e

print (L[2])
                   dt text
0 2016-08-11 14:32:00    f

If need split by date s and hour s:

#changed dataframe for testing
print (df)
               dt text
0  20160811 11:05    a
1  20160812 11:35    b
2  20160813 12:03    c
3  20160811 12:36    d
4  20160811 12:52    e
5  20160811 14:32    f

serie = pd.to_datetime(df.dt)
for i, g in df.groupby([serie.dt.date, serie.dt.hour]):
    print (g.reset_index(drop=True))
               dt text
0  20160811 11:05    a
               dt text
0  20160811 12:36    d
1  20160811 12:52    e
               dt text
0  20160811 14:32    f
               dt text
0  20160812 11:35    b
               dt text
0  20160813 12:03    c    

take the difference of dates with first date and group by total_seconds

df.groupby((df.dt - df.dt[0]).dt.total_seconds() // 3600,
           as_index=False).apply(pd.DataFrame.reset_index, drop=True)

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM