简体   繁体   中英

How could I improve the runtime of this code?

I want to share some information betweeen two dataframes. My code works but takes a long time. Do you know how I could improve my runtime? I am trying to do the following:

I have a dataframe df1 (it has 160 columns, but only the ones shown are important here):

         a_idx  b_idx  c_idx  d_idx  e_idx  f_idx Evt_ID
    0    0      1      3      4      2      6     346642
    1    1      2      3      4      5      5     917426
    2    0      1      3      4      2      2     123543
                        ...
                        

And a dataframe df2 ( ist has 10 columns, but only these are important here):

    Name    Evt_ID
0   Jet1    346642
1   Jet2    346642
2   Jet3    346642
3   Jet4    346642
4   Jet5    346642
5   Jet6    346642
6   Jet7    346642
7   Lepton  346642
8   Jet1    917426
9   Jet2    917426
      ...

Now I want a new column in df2 called "y" with the category of each line. The category can be found with help of df1 and the categories are: category_list = ["a", "b", "c", "d", "e", "f"] and can also be "unknown" . For example the first line in df1 has the values category = [0,1,3,4,2,6] which means that df2 should look like this:

(explanation: fifth number in category is 2 --> Jet( 2 +1) = Jet3 has the fifth category in category_list : "e" )

    Name    Evt_ID    y
0   Jet1    346642    a
1   Jet2    346642    b
2   Jet3    346642    e
3   Jet4    346642    c
4   Jet5    346642    d
5   Jet6    346642    unknown
6   Jet7    346642    f
7   Lepton  346642    unknown
     ...

My way to achieve this is the following:

df["y"] = "unknown"
category_list = ["a", "b", "c", "d", "e", "f"]

for event_id in tqdm(df1.Evt_ID):
    category = df1.loc[df1.Evt_ID == event_id, ["a_idx","b_idx",
                                               "c_idx", "d_idx", 
                                               "e_idx", "f_idx"]].values.squeeze()
    
    i = 0
    for jet_index in category:
        df2.loc[(dfo.Evt_ID == event_id) & (dfo.Name == "Jet".join(str(jet_index+1))), "y"] = category_list[i] 
        i += 1

This code take 30 or 60 minutes to run, depending on the jupyter notebook it is running in. Why would the notebook itself affect the runtime? But more important: How can I improve the runtime?

The following snippet should run way faster thanks to its vectorized structure.

There are two tricks here. First one is to use df.melt which efficiently turns the columns a , b , ..., e into rows. Second one is to join the resulting DataFrame with df2 . That way, all missing values become NaN and can be replaced with unknown with df.fillna .

cols = ["a_idx", "b_idx", "c_idx", "d_idx", "e_idx", "f_idx"]
df = df1[cols + ["Evt_ID"]].rename(columns={c: c[0] for c in cols})

df = df.melt(id_vars="Evt_ID", var_name="y")
df["value"] = "Jet" + (df["value"] + 1).astype(str)

df = df2.join(df.set_index(["Evt_ID", "value"]), on=["Evt_ID", "Name"])
df = df.fillna("unknown")

In the end, df looks like:

     Name  Evt_ID        y
0    Jet1  346642        a
1    Jet2  346642        b
2    Jet3  346642        e
3    Jet4  346642        c
4    Jet5  346642        d
5    Jet6  346642  unknown
6    Jet7  346642        f
7  Lepton  346642  unknown
8    Jet1  917426  unknown
9    Jet2  917426        a

This result was obtained with the following sample data:

import pandas as pd


df1 = pd.DataFrame(
    [
        [0, 1, 3, 4, 2, 6, 346642],
        [1, 2, 3, 4, 5, 5, 917426],
        [0, 1, 3, 4, 2, 2, 123543],
    ],
    columns=["a_idx", "b_idx", "c_idx", "d_idx", "e_idx", "f_idx", "Evt_ID"],
)

df2 = pd.DataFrame(
    [
        ["Jet1", 346642],
        ["Jet2", 346642],
        ["Jet3", 346642],
        ["Jet4", 346642],
        ["Jet5", 346642],
        ["Jet6", 346642],
        ["Jet7", 346642],
        ["Lepton", 346642],
        ["Jet1", 917426],
        ["Jet2", 917426],
    ],
    columns=["Name", "Evt_ID"],
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM