I have a DataFrame which looks similar to this one:
+------------+---------------------+---------+
| action | ts | uid |
+------------+---------------------+---------+
| action1 | 2013-01-01 00:00:00 | 543534 |
| action2 | 2013-01-01 00:00:00 | 543544 |
| action1 | 2013-01-01 00:00:02 | 543542 |
| action2 | 2013-01-01 00:00:03 | 543541 |
| .... | .... | ... |
+------------+---------------------+---------+
I want to count number of actions
of every type performed by each user in a given timerange, so the expected output is smth like this:
uid action1 action2
543534 10 1
543534 0 2
...
I was thinking to solve the problem by first applying .groupby('uid')
then iterating through the grouped object, selecting rows then ts
is in a given range, then concatenating dataframes into resulting dataframe, sorting
So, smth like that:
df = ...
start_date = ...
end_date = ...
result = {}
grouped = df.groupby('uid')
grouped_dict = dict(list(grouped))
for item in grouped.keys:
df = grouped[item]
result[item] = df[df.ts > start_date and df.ts < end_date].size()
I haven't run this code, but I think even if it works it's extremely inefficient. Even converting grouped object to the dictionary takes a lot of time. What would be more efficient approach in this case?
You can group both by uid
and action
:
start_date = pd.to_datetime('2013-01-01 00:00:00')
end_date = pd.to_datetime('2013-01-01 00:00:07')
print df
print df[(df.ts > start_date) & (df.ts < end_date)].groupby(['uid','action'])['ts'].count().unstack('action').fillna(0)
Output:
action ts uid
0 action1 2013-01-01 00:00:00 1
1 action2 2013-01-01 00:00:00 2
2 action1 2013-01-01 00:00:02 2
3 action2 2013-01-01 00:00:03 1
4 action2 2013-01-01 00:00:04 2
5 action2 2013-01-01 00:00:05 1
6 action1 2013-01-01 00:00:06 1
action action1 action2
uid
1 1 2
2 1 1
Looking at the interface of pandas.DataFrame
, I'd select the data like this:
# Select the interesting date range
bydate = df[(df['ts'] > start_date & df.ts < end_date]
# Now this will group for uid, *then* by action
grouped = bydate.groupby(('uid', 'action'))
Now, let's just print the number of actions per uid:
for indices, data in grouped:
print("Uid {}, Action '{}': {}".format(indices[0], indices[1], len(data))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.