I want to share some information betweeen two dataframes. My code works but takes a long time. Do you know how I could improve my runtime? I am trying to do the following:
I have a dataframe df1
(it has 160 columns, but only the ones shown are important here):
a_idx b_idx c_idx d_idx e_idx f_idx Evt_ID
0 0 1 3 4 2 6 346642
1 1 2 3 4 5 5 917426
2 0 1 3 4 2 2 123543
...
And a dataframe df2
( ist has 10 columns, but only these are important here):
Name Evt_ID
0 Jet1 346642
1 Jet2 346642
2 Jet3 346642
3 Jet4 346642
4 Jet5 346642
5 Jet6 346642
6 Jet7 346642
7 Lepton 346642
8 Jet1 917426
9 Jet2 917426
...
Now I want a new column in df2
called "y" with the category of each line. The category can be found with help of df1
and the categories are: category_list = ["a", "b", "c", "d", "e", "f"]
and can also be "unknown"
. For example the first line in df1 has the values category = [0,1,3,4,2,6]
which means that df2
should look like this:
(explanation: fifth number in category
is 2 --> Jet( 2 +1) = Jet3 has the fifth category in category_list
: "e" )
Name Evt_ID y
0 Jet1 346642 a
1 Jet2 346642 b
2 Jet3 346642 e
3 Jet4 346642 c
4 Jet5 346642 d
5 Jet6 346642 unknown
6 Jet7 346642 f
7 Lepton 346642 unknown
...
My way to achieve this is the following:
df["y"] = "unknown"
category_list = ["a", "b", "c", "d", "e", "f"]
for event_id in tqdm(df1.Evt_ID):
category = df1.loc[df1.Evt_ID == event_id, ["a_idx","b_idx",
"c_idx", "d_idx",
"e_idx", "f_idx"]].values.squeeze()
i = 0
for jet_index in category:
df2.loc[(dfo.Evt_ID == event_id) & (dfo.Name == "Jet".join(str(jet_index+1))), "y"] = category_list[i]
i += 1
This code take 30 or 60 minutes to run, depending on the jupyter notebook it is running in. Why would the notebook itself affect the runtime? But more important: How can I improve the runtime?
The following snippet should run way faster thanks to its vectorized structure.
There are two tricks here. First one is to use df.melt
which efficiently turns the columns a
, b
, ..., e
into rows. Second one is to join
the resulting DataFrame with df2
. That way, all missing values become NaN
and can be replaced with unknown
with df.fillna
.
cols = ["a_idx", "b_idx", "c_idx", "d_idx", "e_idx", "f_idx"]
df = df1[cols + ["Evt_ID"]].rename(columns={c: c[0] for c in cols})
df = df.melt(id_vars="Evt_ID", var_name="y")
df["value"] = "Jet" + (df["value"] + 1).astype(str)
df = df2.join(df.set_index(["Evt_ID", "value"]), on=["Evt_ID", "Name"])
df = df.fillna("unknown")
In the end, df
looks like:
Name Evt_ID y
0 Jet1 346642 a
1 Jet2 346642 b
2 Jet3 346642 e
3 Jet4 346642 c
4 Jet5 346642 d
5 Jet6 346642 unknown
6 Jet7 346642 f
7 Lepton 346642 unknown
8 Jet1 917426 unknown
9 Jet2 917426 a
This result was obtained with the following sample data:
import pandas as pd
df1 = pd.DataFrame(
[
[0, 1, 3, 4, 2, 6, 346642],
[1, 2, 3, 4, 5, 5, 917426],
[0, 1, 3, 4, 2, 2, 123543],
],
columns=["a_idx", "b_idx", "c_idx", "d_idx", "e_idx", "f_idx", "Evt_ID"],
)
df2 = pd.DataFrame(
[
["Jet1", 346642],
["Jet2", 346642],
["Jet3", 346642],
["Jet4", 346642],
["Jet5", 346642],
["Jet6", 346642],
["Jet7", 346642],
["Lepton", 346642],
["Jet1", 917426],
["Jet2", 917426],
],
columns=["Name", "Evt_ID"],
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.