简体   繁体   中英

Selecting rows in a given range from grouped object

I have a DataFrame which looks similar to this one:

+------------+---------------------+---------+
|    action  |  ts                 |   uid   |
+------------+---------------------+---------+
| action1    | 2013-01-01 00:00:00 | 543534  |  
| action2    | 2013-01-01 00:00:00 | 543544  |
| action1    | 2013-01-01 00:00:02 | 543542  |
| action2    | 2013-01-01 00:00:03 | 543541  |
|   ....     |       ....          |   ...   |
+------------+---------------------+---------+

I want to count number of actions of every type performed by each user in a given timerange, so the expected output is smth like this:

    uid action1 action2
543534    10      1
543534    0      2
 ...

I was thinking to solve the problem by first applying .groupby('uid') then iterating through the grouped object, selecting rows then ts is in a given range, then concatenating dataframes into resulting dataframe, sorting

So, smth like that:

df = ...
start_date = ...
end_date = ...
result = {}

grouped = df.groupby('uid')
grouped_dict = dict(list(grouped))

for item in grouped.keys:
    df = grouped[item]    
    result[item] = df[df.ts > start_date and df.ts < end_date].size()

I haven't run this code, but I think even if it works it's extremely inefficient. Even converting grouped object to the dictionary takes a lot of time. What would be more efficient approach in this case?

You can group both by uid and action :

start_date = pd.to_datetime('2013-01-01 00:00:00')
end_date = pd.to_datetime('2013-01-01 00:00:07')
print df
print df[(df.ts > start_date) & (df.ts < end_date)].groupby(['uid','action'])['ts'].count().unstack('action').fillna(0)

Output:

    action                  ts  uid
0  action1 2013-01-01 00:00:00    1
1  action2 2013-01-01 00:00:00    2
2  action1 2013-01-01 00:00:02    2
3  action2 2013-01-01 00:00:03    1
4  action2 2013-01-01 00:00:04    2
5  action2 2013-01-01 00:00:05    1
6  action1 2013-01-01 00:00:06    1
action  action1  action2
uid                     
1             1        2
2             1        1

Looking at the interface of pandas.DataFrame , I'd select the data like this:

# Select the interesting date range
bydate = df[(df['ts'] > start_date & df.ts < end_date]
# Now this will group for uid, *then* by action
grouped = bydate.groupby(('uid', 'action'))

Now, let's just print the number of actions per uid:

for indices, data in grouped:
    print("Uid {}, Action '{}': {}".format(indices[0], indices[1], len(data))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM