简体   繁体   中英

Slice and concatenate dataframe efficiently in Pandas

I have a dataset with id , event and metric columns:

df = pd.DataFrame([['a','x', 1], 
                  ['a','x',2],  
                  ['b','x',3],
                  ['b','x',3],
                  ['a','z',4],  
                  ['a','z',5],
                  ['b','y',5]], columns = ['id','event','metric'])


   id event metric
0   a   x   1
1   a   x   2
2   b   x   3
3   b   x   3
4   a   z   4
5   a   z   5
6   b   y   5

I need to find the last event in each id and get the row with this event plus all the rows above with this id . The resulting dataframe should be the concatenation of such slices with the following columns:

  1. index in the original df
  2. new id formed as 'id we used to get the filtered slice' + 'the last event in the filtered slice'

Desired output:

  index new_id
0   0   ax
1   1   ax
2   2   bx
3   3   bx
4   0   az
5   1   az
6   4   az
7   5   az
8   2   by
9   3   by
10  6   by

I produced the desired output with the following code:

df['id_event'] = df.id + df.event
id_events = df.id_event.unique()
df_all = pd.DataFrame()


for i,id_event in (enumerate(id_events)):
    id = id_event[:1]
    event = id_event[1:]
    last_row_id = df[df.event==event].iloc[-1].name
    temp = df.iloc[: last_row_id +1][df.id==id]
    temp['new_id'] = id_event

    df_all = pd.concat([df_all, temp.reset_index()], axis=0, sort=False)


df_all.reset_index()[['index', 'new_id']]

The problem is I have around 20M rows, so it takes around 20 hours to get the result. I'm trying to solve this in efficient way, eg without loops.

The main performance problem is you're calling pd.concat() on every iteration, which is expensive. Try this:

results = [] # added
for i,id_event in (enumerate(id_events)):
    id = id_event[:1]
    event = id_event[1:]
    last_row_id = df[df.event==event].iloc[-1].name
    temp = df.iloc[: last_row_id +1][df.id==id]
    temp['new_id'] = id_event
    results.append(temp.reset_index()) # changed

df_all = pd.concat(results, sort=False) # changed

This should save a lot of time--let us know how much.

Also note that df[df.event==event].iloc[-1].name can be more simply written as df[df.event==event].index[-1] .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM