How could I improve the runtime of this code?

Question

I want to share some information betweeen two dataframes. My code works but takes a long time. Do you know how I could improve my runtime? I am trying to do the following:

I have a dataframe df1 (it has 160 columns, but only the ones shown are important here):

         a_idx  b_idx  c_idx  d_idx  e_idx  f_idx Evt_ID
    0    0      1      3      4      2      6     346642
    1    1      2      3      4      5      5     917426
    2    0      1      3      4      2      2     123543
                        ...

And a dataframe df2 ( ist has 10 columns, but only these are important here):

    Name    Evt_ID
0   Jet1    346642
1   Jet2    346642
2   Jet3    346642
3   Jet4    346642
4   Jet5    346642
5   Jet6    346642
6   Jet7    346642
7   Lepton  346642
8   Jet1    917426
9   Jet2    917426
      ...

Now I want a new column in df2 called "y" with the category of each line. The category can be found with help of df1 and the categories are: category_list = ["a", "b", "c", "d", "e", "f"] and can also be "unknown" . For example the first line in df1 has the values category = [0,1,3,4,2,6] which means that df2 should look like this:

(explanation: fifth number in category is 2 --> Jet( 2 +1) = Jet3 has the fifth category in category_list : "e" )

    Name    Evt_ID    y
0   Jet1    346642    a
1   Jet2    346642    b
2   Jet3    346642    e
3   Jet4    346642    c
4   Jet5    346642    d
5   Jet6    346642    unknown
6   Jet7    346642    f
7   Lepton  346642    unknown
     ...

My way to achieve this is the following:

df["y"] = "unknown"
category_list = ["a", "b", "c", "d", "e", "f"]

for event_id in tqdm(df1.Evt_ID):
    category = df1.loc[df1.Evt_ID == event_id, ["a_idx","b_idx",
                                               "c_idx", "d_idx", 
                                               "e_idx", "f_idx"]].values.squeeze()
    
    i = 0
    for jet_index in category:
        df2.loc[(dfo.Evt_ID == event_id) & (dfo.Name == "Jet".join(str(jet_index+1))), "y"] = category_list[i] 
        i += 1

This code take 30 or 60 minutes to run, depending on the jupyter notebook it is running in. Why would the notebook itself affect the runtime? But more important: How can I improve the runtime?

Answer 1

The following snippet should run way faster thanks to its vectorized structure.

There are two tricks here. First one is to use df.melt which efficiently turns the columns a , b , ..., e into rows. Second one is to join the resulting DataFrame with df2 . That way, all missing values become NaN and can be replaced with unknown with df.fillna .

cols = ["a_idx", "b_idx", "c_idx", "d_idx", "e_idx", "f_idx"]
df = df1[cols + ["Evt_ID"]].rename(columns={c: c[0] for c in cols})

df = df.melt(id_vars="Evt_ID", var_name="y")
df["value"] = "Jet" + (df["value"] + 1).astype(str)

df = df2.join(df.set_index(["Evt_ID", "value"]), on=["Evt_ID", "Name"])
df = df.fillna("unknown")

In the end, df looks like:

     Name  Evt_ID        y
0    Jet1  346642        a
1    Jet2  346642        b
2    Jet3  346642        e
3    Jet4  346642        c
4    Jet5  346642        d
5    Jet6  346642  unknown
6    Jet7  346642        f
7  Lepton  346642  unknown
8    Jet1  917426  unknown
9    Jet2  917426        a

This result was obtained with the following sample data:

import pandas as pd


df1 = pd.DataFrame(
    [
        [0, 1, 3, 4, 2, 6, 346642],
        [1, 2, 3, 4, 5, 5, 917426],
        [0, 1, 3, 4, 2, 2, 123543],
    ],
    columns=["a_idx", "b_idx", "c_idx", "d_idx", "e_idx", "f_idx", "Evt_ID"],
)

df2 = pd.DataFrame(
    [
        ["Jet1", 346642],
        ["Jet2", 346642],
        ["Jet3", 346642],
        ["Jet4", 346642],
        ["Jet5", 346642],
        ["Jet6", 346642],
        ["Jet7", 346642],
        ["Lepton", 346642],
        ["Jet1", 917426],
        ["Jet2", 917426],
    ],
    columns=["Name", "Evt_ID"],
)

How could I improve the runtime of this code?

Question

1 answers

solution1
2 ACCPTED 2021-05-27 08:45:21

How could I improve the runtime of this code?

Question

1 answers

solution1 2 ACCPTED 2021-05-27 08:45:21

solution1
2 ACCPTED 2021-05-27 08:45:21