I have a dataframe that looks like this (normally it has many users):
userid | activityday
222 2015-01-09 12:00
222 2015-01-10 12:00
222 2015-01-11 12:00
222 2015-01-13 12:00
222 2015-01-14 12:00
222 2015-01-15 12:00
222 2015-01-17 12:00
222 2015-01-18 12:00
222 2015-01-19 12:00
222 2015-01-20 12:00
222 2015-01-20 12:00
I want to obtain the total number of consecutive active and inactive days until a given date. For example, if the date is 2015-01-23 then:
userid | days_active_jb | days_inactive_jb | ttl_days_active | ttl_days_inactive
222 | 3 | 2 | 10 | 2
Or, if the given date is 2015-01-15 then:
userid | days_active_jb | days_inactive_jb | ttl_days_active | ttl_days_inactive
222 | 2 | 0 | 5 | 1
I have around 300.000 rows to process to obtain this final dataframe. I wonder what would be an effective way to achieve this. Any ideas?
Here are the explanations for each columns:
days_active_jb
: number of days student has an activity in a row just before the given date.
days_inactive_jb
: number of days student has NO activity in a row just before the given date.
ttl_days_active
: number of days student has activity any day before the given date.
ttl_days_inactive
: number of days student has NO activity any day before the given date.
Setup:
df
Out[1714]:
userid activityday
0 222 2015-01-09 12:00:00
1 222 2015-01-10 12:00:00
2 222 2015-01-11 12:00:00
3 222 2015-01-13 12:00:00
4 222 2015-01-14 12:00:00
5 222 2015-01-15 12:00:00
6 222 2015-01-17 12:00:00
7 222 2015-01-18 12:00:00
8 222 2015-01-19 12:00:00
9 222 2015-01-20 12:00:00
11 322 2015-01-09 12:00:00
12 322 2015-01-10 12:00:00
13 322 2015-01-11 12:00:00
14 322 2015-01-13 12:00:00
15 322 2015-01-14 12:00:00
16 322 2015-01-15 12:00:00
17 322 2015-01-17 12:00:00
18 322 2015-01-18 12:00:00
19 322 2015-01-19 12:00:00
20 322 2015-01-20 12:00:00
Solution
def days_active_jb(x):
x = x[x<pd.to_datetime(cut_off_days)]
if len(x) == 0:
return 0
x = [e.date() for e in x.sort_values(ascending=False)]
prev = x.pop(0)
i = 1
for e in x:
if (prev-e).days == 1:
i+=1
prev = e
else:
break
return i
def days_inactive_jb(x):
diff = (pd.to_datetime(cut_off_days) -max(x)).days
return 0 if diff<0 else diff
def ttl_days_active(x):
x = x[x<pd.to_datetime(cut_off_days)]
return len(x[x<pd.to_datetime(cut_off_days)])
def ttl_days_inactive(x):
#counter the missing days between start and end dates
x = x[x<pd.to_datetime(cut_off_days)]
return len(pd.date_range(min(x),max(x))) - len(x)
#drop duplicate userid-activityday pairs
df = df.drop_duplicates(subset=['userid','activityday'])
cut_off_days = '2015-01-23'
df.sort_values(by=['userid','activityday'],ascending=False).\
groupby('userid')['activityday'].\
agg([days_active_jb,
days_inactive_jb,
ttl_days_active,
ttl_days_inactive]).\
astype(np.int64)
Out[1856]:
days_active_jb days_inactive_jb ttl_days_active ttl_days_inactive
userid
222 4 2 10 2
322 4 2 10 2
cut_off_days = '2015-01-15'
df.sort_values(by=['userid','activityday'],ascending=False).\
groupby('userid')['activityday'].\
agg([days_active_jb,
days_inactive_jb,
ttl_days_active,
ttl_days_inactive]).\
astype(np.int64)
Out[1863]:
days_active_jb days_inactive_jb ttl_days_active ttl_days_inactive
userid
222 2 0 5 1
322 2 0 5 1
'''
this code will work for different user id on the same file
the data should be present strictly on the format you provide
'''
import datetime
'''
following list comprehension generates the list of list
[uid,activedate,time] from file for different uid
'''
data=[item2 for item2 in[item.strip().split() for item in[data for data \
in open('c:/python34/stack.txt').readlines() ]] if item2]
data.pop(0)## pops first element ie the header
def active_dates(active_list,uid):
'''returns the list of list of year,month and day of active dates
for given user id as 'uid' '''
for item in active_list:
item.pop(2) #removing time
return [[eval(item4.lstrip('0'))for item4 in item3] for item3 in
[item2.split('-') for item2 in [item[1]for item in data if \
item[0]==uid]]]
def active_days(from_,to,dates):
#returns the no of active days from start date'from_' to till date
#'to'
count=0
for item in dates:
d1=datetime.date(item[0],item[1],item[2])
if d1>from_ and d1<to:
count+=1
return count
def remove_duplicates(lst):
#removes the duplicates if active at different time on the same day
lst.sort()
i = len(lst) - 1
while i > 0:
if lst[i] == lst[i - 1]:
lst.pop(i)
i -= 1
return lst
active=remove_duplicates(active_dates(data,'222')) #pass uid variable as string
from_=datetime.date(2015,1,1)
to=datetime.date(2015,1,26)
activedays=active_days(from_,to,active)
total_days=to-from_
inactive_days=total_days.days-activedays
print('activedays: %s and inactive days: %s'%(activedays,inactive_days))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.