I have a quite large dataframe (3 million rows) that looks like this:
df = pd.DataFrame({'user_id' : ['100','101','102','103','104'],
'service_id' : ['73', '73', '46', '12', '12'],
'date_of_service' : ['2015-06-10 17:00:00',
'2014-09-27 17:00:00',
'2015-01-12 17:00:00',
'2012-08-22 17:00:00',
'2013-03-07 17:00:00']})
df
I have a column for user id. The users are providing services, each services having an id (service_id). Each service has a date (date_of_service).
I would like to create a column that computes, for each row, the number of services that the user has done in the past (including the present one), ie I would like the following result dataframe:
df = pd.DataFrame({'user_id' : ['100','101','102','103','104'],
'service_id' : ['73', '73', '46', '12', '12'],
'date_of_service' : ['2015-06-10 17:00:00',
'2014-09-27 17:00:00',
'2015-01-12 17:00:00',
'2012-08-22 17:00:00',
'2013-03-07 17:00:00'],
'number_of_past_services' : [2, 1, 1, 1, 2]})
df
What I have done
I used a groupby and a count:
df['count_services'] = df.ix[:, 1:].groupby('user_id').transform('count')
The problem is that, here, I count all occurrences over the whole dataset. What I want is to have past occurrences!
I have tried to filter using sql-like operations, for example:
len(df[df.date_of_service < df['date_of_service'][0]][df.user_id == df.user_id[0]])
which gives the good result for the first row. However, this computation takes about 1 second for only one row!
I would like to know how to scale this so that I can create a column efficiently.
IIUC you can do it this way:
In [69]: df['number_of_past_services'] = df.sort_values('date_of_service') \
.assign(x=1) \
.groupby('service_id')['x'].cumsum()
In [70]: df
Out[70]:
date_of_service service_id user_id number_of_past_services
0 2015-06-10 17:00:00 73 100 2
1 2014-09-27 17:00:00 73 101 1
2 2015-01-12 17:00:00 46 102 1
3 2012-08-22 17:00:00 12 103 1
4 2013-03-07 17:00:00 12 104 2
If I understood it correctly, you could :
date_of_service
past_occurences.get(user_id, 0)
holds the information you need past_occurences[user_id] = past_occurences.get(user_id, 0) + 1
will update the dictionary. Of all steps, the sort will be the slowest. The rest should be reasonably fast.
PS: You could also use a defaultdict
. Here's an example .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.