Slice and concatenate dataframe efficiently in Pandas

Question

I have a dataset with id , event and metric columns:

df = pd.DataFrame([['a','x', 1], 
                  ['a','x',2],  
                  ['b','x',3],
                  ['b','x',3],
                  ['a','z',4],  
                  ['a','z',5],
                  ['b','y',5]], columns = ['id','event','metric'])


   id event metric
0   a   x   1
1   a   x   2
2   b   x   3
3   b   x   3
4   a   z   4
5   a   z   5
6   b   y   5

I need to find the last event in each id and get the row with this event plus all the rows above with this id . The resulting dataframe should be the concatenation of such slices with the following columns:

index in the original df
new id formed as 'id we used to get the filtered slice' + 'the last event in the filtered slice'

Desired output:

  index new_id
0   0   ax
1   1   ax
2   2   bx
3   3   bx
4   0   az
5   1   az
6   4   az
7   5   az
8   2   by
9   3   by
10  6   by

I produced the desired output with the following code:

df['id_event'] = df.id + df.event
id_events = df.id_event.unique()
df_all = pd.DataFrame()


for i,id_event in (enumerate(id_events)):
    id = id_event[:1]
    event = id_event[1:]
    last_row_id = df[df.event==event].iloc[-1].name
    temp = df.iloc[: last_row_id +1][df.id==id]
    temp['new_id'] = id_event

    df_all = pd.concat([df_all, temp.reset_index()], axis=0, sort=False)


df_all.reset_index()[['index', 'new_id']]

The problem is I have around 20M rows, so it takes around 20 hours to get the result. I'm trying to solve this in efficient way, eg without loops.

Answer 1

The main performance problem is you're calling pd.concat() on every iteration, which is expensive. Try this:

results = [] # added
for i,id_event in (enumerate(id_events)):
    id = id_event[:1]
    event = id_event[1:]
    last_row_id = df[df.event==event].iloc[-1].name
    temp = df.iloc[: last_row_id +1][df.id==id]
    temp['new_id'] = id_event
    results.append(temp.reset_index()) # changed

df_all = pd.concat(results, sort=False) # changed

This should save a lot of time--let us know how much.

Also note that df[df.event==event].iloc[-1].name can be more simply written as df[df.event==event].index[-1] .

Slice and concatenate dataframe efficiently in Pandas

Question

1 answers

solution1
0 2019-11-17 03:50:22

Slice and concatenate dataframe efficiently in Pandas

Question

1 answers

solution1 0 2019-11-17 03:50:22

solution1
0 2019-11-17 03:50:22