简体   繁体   中英

Count all past occurences of an element in a large dataset

I have a quite large dataframe (3 million rows) that looks like this:

df = pd.DataFrame({'user_id' : ['100','101','102','103','104'],
           'service_id' : ['73', '73', '46', '12', '12'],
           'date_of_service' : ['2015-06-10 17:00:00', 
                               '2014-09-27 17:00:00', 
                               '2015-01-12 17:00:00', 
                               '2012-08-22 17:00:00', 
                               '2013-03-07 17:00:00']})
df

I have a column for user id. The users are providing services, each services having an id (service_id). Each service has a date (date_of_service).

I would like to create a column that computes, for each row, the number of services that the user has done in the past (including the present one), ie I would like the following result dataframe:

df = pd.DataFrame({'user_id' : ['100','101','102','103','104'],
           'service_id' : ['73', '73', '46', '12', '12'],
           'date_of_service' : ['2015-06-10 17:00:00', 
                               '2014-09-27 17:00:00', 
                               '2015-01-12 17:00:00', 
                               '2012-08-22 17:00:00', 
                               '2013-03-07 17:00:00'],
              'number_of_past_services' : [2, 1, 1, 1, 2]})
df

What I have done

I used a groupby and a count:

df['count_services'] = df.ix[:, 1:].groupby('user_id').transform('count')

The problem is that, here, I count all occurrences over the whole dataset. What I want is to have past occurrences!

I have tried to filter using sql-like operations, for example:

len(df[df.date_of_service < df['date_of_service'][0]][df.user_id == df.user_id[0]])

which gives the good result for the first row. However, this computation takes about 1 second for only one row!

I would like to know how to scale this so that I can create a column efficiently.

IIUC you can do it this way:

In [69]: df['number_of_past_services'] = df.sort_values('date_of_service') \
                                           .assign(x=1) \
                                           .groupby('service_id')['x'].cumsum()

In [70]: df
Out[70]:
      date_of_service service_id user_id  number_of_past_services
0 2015-06-10 17:00:00         73     100                        2
1 2014-09-27 17:00:00         73     101                        1
2 2015-01-12 17:00:00         46     102                        1
3 2012-08-22 17:00:00         12     103                        1
4 2013-03-07 17:00:00         12     104                        2

If I understood it correctly, you could :

  • sort your data by date_of_service
  • create an empty dictionary of past occurences
  • iterate over all the sorted rows:
    • past_occurences.get(user_id, 0) holds the information you need
    • past_occurences[user_id] = past_occurences.get(user_id, 0) + 1 will update the dictionary.

Of all steps, the sort will be the slowest. The rest should be reasonably fast.

PS: You could also use a defaultdict . Here's an example .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM