简体   繁体   中英

Calculating total number of consecutive days and missing days in a timeseries data

I have a dataframe that looks like this (normally it has many users):

userid  |  activityday
222        2015-01-09 12:00
222        2015-01-10 12:00
222        2015-01-11 12:00
222        2015-01-13 12:00
222        2015-01-14 12:00
222        2015-01-15 12:00
222        2015-01-17 12:00
222        2015-01-18 12:00
222        2015-01-19 12:00
222        2015-01-20 12:00
222        2015-01-20 12:00

I want to obtain the total number of consecutive active and inactive days until a given date. For example, if the date is 2015-01-23 then:

userid | days_active_jb  | days_inactive_jb | ttl_days_active | ttl_days_inactive
222    | 3               | 2                | 10              | 2

Or, if the given date is 2015-01-15 then:

userid | days_active_jb  | days_inactive_jb | ttl_days_active | ttl_days_inactive
222    | 2               | 0                | 5              | 1

I have around 300.000 rows to process to obtain this final dataframe. I wonder what would be an effective way to achieve this. Any ideas?

Here are the explanations for each columns:

days_active_jb : number of days student has an activity in a row just before the given date.

days_inactive_jb : number of days student has NO activity in a row just before the given date.

ttl_days_active : number of days student has activity any day before the given date.

ttl_days_inactive : number of days student has NO activity any day before the given date.

Setup:

df
Out[1714]: 
    userid         activityday
0      222 2015-01-09 12:00:00
1      222 2015-01-10 12:00:00
2      222 2015-01-11 12:00:00
3      222 2015-01-13 12:00:00
4      222 2015-01-14 12:00:00
5      222 2015-01-15 12:00:00
6      222 2015-01-17 12:00:00
7      222 2015-01-18 12:00:00
8      222 2015-01-19 12:00:00
9      222 2015-01-20 12:00:00
11     322 2015-01-09 12:00:00
12     322 2015-01-10 12:00:00
13     322 2015-01-11 12:00:00
14     322 2015-01-13 12:00:00
15     322 2015-01-14 12:00:00
16     322 2015-01-15 12:00:00
17     322 2015-01-17 12:00:00
18     322 2015-01-18 12:00:00
19     322 2015-01-19 12:00:00
20     322 2015-01-20 12:00:00

Solution

def days_active_jb(x):
    x = x[x<pd.to_datetime(cut_off_days)]    
    if len(x) == 0:
        return 0
    x = [e.date() for e in x.sort_values(ascending=False)]
    prev = x.pop(0)
    i = 1    
    for e in x:             
        if (prev-e).days == 1:
            i+=1
            prev = e
        else:
            break
    return i

def days_inactive_jb(x):
    diff = (pd.to_datetime(cut_off_days) -max(x)).days
    return 0 if diff<0 else diff    

def ttl_days_active(x):    
    x = x[x<pd.to_datetime(cut_off_days)]  
    return len(x[x<pd.to_datetime(cut_off_days)])

def ttl_days_inactive(x):    
    #counter the missing days between start and end dates
    x = x[x<pd.to_datetime(cut_off_days)]  
    return len(pd.date_range(min(x),max(x))) - len(x)

#drop duplicate userid-activityday pairs
df = df.drop_duplicates(subset=['userid','activityday'])

cut_off_days = '2015-01-23'
df.sort_values(by=['userid','activityday'],ascending=False).\
              groupby('userid')['activityday'].\
              agg([days_active_jb,
                   days_inactive_jb,
                   ttl_days_active,
                   ttl_days_inactive]).\
              astype(np.int64)

Out[1856]: 
        days_active_jb  days_inactive_jb  ttl_days_active  ttl_days_inactive
userid                                                                      
222                  4                 2               10                  2
322                  4                 2               10                  2


cut_off_days = '2015-01-15'
df.sort_values(by=['userid','activityday'],ascending=False).\
              groupby('userid')['activityday'].\
              agg([days_active_jb,
                   days_inactive_jb,
                   ttl_days_active,
                   ttl_days_inactive]).\
              astype(np.int64)

Out[1863]: 
        days_active_jb  days_inactive_jb  ttl_days_active  ttl_days_inactive
userid                                                                      
222                  2                 0                5                  1
322                  2                 0                5                  1
    '''
    this code will work for different user id on the same file
    the data should be present strictly on the format you provide
    '''
    import datetime
    '''
    following list comprehension generates the list of list 
    [uid,activedate,time] from file for different uid
    '''
    data=[item2 for item2 in[item.strip().split() for item in[data for data \ 
            in open('c:/python34/stack.txt').readlines() ]] if item2]
    data.pop(0)## pops first element ie the header

    def active_dates(active_list,uid):
        '''returns the list of list of year,month and day of active dates 
             for given user id as 'uid' '''
        for item in active_list:
            item.pop(2) #removing time
        return [[eval(item4.lstrip('0'))for item4 in item3] for item3 in 
            [item2.split('-') for item2 in [item[1]for item in data if \
                  item[0]==uid]]]


    def active_days(from_,to,dates):
        #returns the no of active days from start date'from_' to till date 
        #'to'    
        count=0
        for item in dates:
            d1=datetime.date(item[0],item[1],item[2])
            if d1>from_ and d1<to:
                count+=1
        return count
    def remove_duplicates(lst):
        #removes the duplicates if active at different time on the same day
        lst.sort()
        i = len(lst) - 1
        while i > 0:
            if lst[i] == lst[i - 1]:
                lst.pop(i)
            i -= 1
        return lst

    active=remove_duplicates(active_dates(data,'222')) #pass uid variable as string
    from_=datetime.date(2015,1,1)
    to=datetime.date(2015,1,26)
    activedays=active_days(from_,to,active)
    total_days=to-from_
    inactive_days=total_days.days-activedays
    print('activedays: %s and inactive days: %s'%(activedays,inactive_days))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM