I have a dataset with id
, event
and metric
columns:
df = pd.DataFrame([['a','x', 1],
['a','x',2],
['b','x',3],
['b','x',3],
['a','z',4],
['a','z',5],
['b','y',5]], columns = ['id','event','metric'])
id event metric
0 a x 1
1 a x 2
2 b x 3
3 b x 3
4 a z 4
5 a z 5
6 b y 5
I need to find the last event
in each id
and get the row with this event
plus all the rows above with this id
. The resulting dataframe should be the concatenation of such slices with the following columns:
Desired output:
index new_id
0 0 ax
1 1 ax
2 2 bx
3 3 bx
4 0 az
5 1 az
6 4 az
7 5 az
8 2 by
9 3 by
10 6 by
I produced the desired output with the following code:
df['id_event'] = df.id + df.event
id_events = df.id_event.unique()
df_all = pd.DataFrame()
for i,id_event in (enumerate(id_events)):
id = id_event[:1]
event = id_event[1:]
last_row_id = df[df.event==event].iloc[-1].name
temp = df.iloc[: last_row_id +1][df.id==id]
temp['new_id'] = id_event
df_all = pd.concat([df_all, temp.reset_index()], axis=0, sort=False)
df_all.reset_index()[['index', 'new_id']]
The problem is I have around 20M rows, so it takes around 20 hours to get the result. I'm trying to solve this in efficient way, eg without loops.
The main performance problem is you're calling pd.concat()
on every iteration, which is expensive. Try this:
results = [] # added
for i,id_event in (enumerate(id_events)):
id = id_event[:1]
event = id_event[1:]
last_row_id = df[df.event==event].iloc[-1].name
temp = df.iloc[: last_row_id +1][df.id==id]
temp['new_id'] = id_event
results.append(temp.reset_index()) # changed
df_all = pd.concat(results, sort=False) # changed
This should save a lot of time--let us know how much.
Also note that df[df.event==event].iloc[-1].name
can be more simply written as df[df.event==event].index[-1]
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.