简体   繁体   中英

How to compute the time difference between entries?

Suppose I have the following Pandas DataFrame. I want to compute the time (in seconds) since the last observation of each ip . Notice that the data is not necessarily ordered.

dict = {'ip':[123, 326, 123, 326], 'hour': [14, 12, 12, 1], 'minute': [54, 23, 41, 8], 'second': [45, 29, 19, 33]}

df = pd.DataFrame(dict, columns = dict.keys())

       ip  hour  minute  second
0  123    14      54      45
1  326    12      23      29
2  123    12      41      19
3  326     1       8      33

For example, I would like to add a column on the first entry saying that when ip 123 was captured by the second time, the equivalent in seconds of (14:54:45 - 12:41:19) had been elapsed since the last appearence in the dataset.

I am trying something with groupby but with no success. Any ideas?

Thanks in advance!!!

You can convert your hour,min,sec column to date time for may by using to_datetime , then we groupby and get the different ( diff )

df['Time']=pd.to_datetime(df.iloc[:,1:].astype(str).apply(''.join,1),format='%H%M%S')

df['Yourneed']=df.groupby('ip').Time.diff().dt.total_seconds()
df
    ip  hour  minute  second                Time  Yourneed
0  123    14      54      45 1900-01-01 14:54:45       NaN
1  326    12      23      29 1900-01-01 12:23:29       NaN
2  123    12      41      19 1900-01-01 12:41:19   -8006.0
3  326     1       8      33 1900-01-01 18:03:03   20374.0

You were close with the groupby. Creating a proper datetime column was probably the missing piece:

from datetime import datetime
import pandas

def row_to_date(row):
    today = datetime.today()
    return datetime(
        today.year,
        today.month,
        today.day,
        row['hour'],
        row['minute'],
        row['second']
    )


data = {
    'ip':[123, 326, 123, 326],
    'hour': [14, 12, 12, 1],
    'minute': [54, 23, 41, 8],
    'second': [45, 29, 19, 33]
}


df = (
    pandas.DataFrame(data)
        .assign(date=lambda df: df.apply(row_to_date, axis=1))
        .groupby(by=['ip'])
        .apply(lambda g: g.diff()['date'].dt.total_seconds())
        .dropna()
        .to_frame('elapsed_seconds')
        .reset_index(level=1, drop=True)
)
df

And so I get:

     elapsed_seconds
ip                  
123          -8006.0
326         -40496.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM