简体   繁体   中英

Time series analysis in pandas

I have a Pandas DataFrame containing the visits on a website, I have two columns ID number and the date in the format YYYY-mm-dd HH:mm:ss .

I would like to get a data frame containing the time difference between any visit of a customer. I found how to get the numbers of visits using GROUPBY but I don't know for the rest.

Edit:

No.      IDs      date
 1      4678     2012-11-30 23:59:59
 2      4703     2012-11-30 23:59:23
 3      4678     2012-11-30 23:58:46
 4      5803     2012-11-30 23:58:19
 5      4678     2012-11-30 23:58:07

And I would like to get for each ID number something like this:

      Visit_number      duration since last visit
4678        1                    0
            2                    73s
            3                    39s

For now I only managed to calculate the number of visits for each ID number with array.groupby(['IDs']).size()

To calculate the visit number, you can use groupby and cumcount :

In [76]: df['Visit_Number'] = df.groupby('IDs').cumcount() + 1

Next, for the duration, you can use diff for each group:

In [77]: df['duration'] = - df.groupby('IDs')['date'].diff()


In [78]: df
Out[78]: 
    IDs                date  Visit_Number  duration
0  4678 2012-11-30 23:59:59             1       NaT
1  4703 2012-11-30 23:59:23             1       NaT
2  4678 2012-11-30 23:58:46             2  00:01:13
3  5803 2012-11-30 23:58:19             1       NaT
4  4678 2012-11-30 23:58:07             3  00:00:39

This gives you the difference as a timedelta , to have it in seconds and fill the NaN values:

In [79]: df['duration'] = df['duration'].astype('timedelta64[s]').fillna(0)

In [80]: df
Out[80]: 
    IDs                date  Visit_Number  duration
0  4678 2012-11-30 23:59:59             1         0
1  4703 2012-11-30 23:59:23             1         0
2  4678 2012-11-30 23:58:46             2        73
3  5803 2012-11-30 23:58:19             1         0
4  4678 2012-11-30 23:58:07             3        39

Something like the following:

import pandas as pd
import datetime

a = pd.read_csv("a.csv")
a.date = a.date.map(lambda s: datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S"))
for user_id, series in a.sort("date").groupby("id"):
    print user_id, series.date.diff()

Outputs:

4678 4        NaT
2   00:00:39
0   00:01:13
Name: date, dtype: timedelta64[ns]
4703 1   NaT
Name: date, dtype: timedelta64[ns]
5803 3   NaT
Name: date, dtype: timedelta64[ns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM