[英]For loop pandas and numpy: Performance
I have coded the following for loop.我已经编写了以下 for 循环。 The main idea is that in each occurrence of 'D' in the column 'A_D', it looks for all the possible cases where some specific conditions should happen.
主要思想是,在“A_D”列中每次出现“D”时,它都会查找某些特定条件应该发生的所有可能情况。 When all the conditions are verified, a value is added to a list.
当所有条件都得到验证时,一个值被添加到列表中。
a = []
for i in df.index:
if df['A_D'][i] == 'D':
if df['TROUND_ID'][i] == ' ':
vb = df[(df['O_D'] == df['O_D'][i])
& (df['A_D'] == 'A' )
& (df['Terminal'] == df['Terminal'][i])
& (df['Operator'] == df['Operator'][i])]
number = df['number_ac'][i]
try: ## if all the conditions above are verified a value is added to a list
x = df.START[i] - pd.Timedelta(int(number), unit='m')
value = vb.loc[(vb.START-x).abs().idxmin()].FlightID
except: ## if are not verified, several strings are added to the list
value = 'No_link_found'
else:
value = 'Has_link'
else:
value = 'IsArrival'
a.append(value)
My main problem is that df has millions of rows, therefore this for loop is way too time consuming.我的主要问题是 df 有数百万行,因此这个 for 循环太耗时了。 Is there any vectorized solution where I do not need to use a for loop?
是否有任何不需要使用 for 循环的矢量化解决方案?
An initial set of improvements: use apply
rather than a loop;最初的一组改进:使用
apply
而不是循环; create a second dataframe at the start of the rows where df["A_D"] == "A"
;在
df["A_D"] == "A"
的行的开头创建第二个数据帧; and vectorise the value x
.并向量化值
x
。
arr = df[df["A_D"] == "A"]
# if the next line is slow, apply it only to those rows where x is needed
df["x"] = df.START - pd.Timedelta(int(df["number_ac"]), unit='m')
def link_func(row):
if row["A_D"] != "D":
return "IsArrival"
if row["TROUND_ID"] != " ":
return "Has_link"
vb = arr[arr["O_D"] == row["O_D"]
& arr["Terminal"] == row["Terminal"]
& arr["Operator"] == row["Operator"]]
try:
return vb.loc[(vb.START - row["x"]).abs().idxmin()].FlightID
except:
return "No_link_found"
df["a"] = df.apply(link_func, axis=1)
Using apply
is apparently more efficient but does not automatically vectorise the calculation.使用
apply
显然更有效,但不会自动矢量化计算。 But finding a value in arr
based on each row of df
is inherently time consuming, however efficiently it is implemented.但是根据
df
每一行在arr
找到一个值本质上是耗时的,但它的实现效率如何。 Consider whether the two parts of the original dataframe (where df["A_D"] == "A"
and df["A_D"] == "D"
, respectively) can be reshaped into a wide format somehow.考虑是否可以以某种方式将原始数据帧的两个部分(其中
df["A_D"] == "A"
和df["A_D"] == "D"
)改造成宽格式。
EDIT: You might be able to speed up the querying of arr
by storing query strings in df
, like this:编辑:您可以通过将查询字符串存储在
df
来加快对arr
的查询,如下所示:
df["query_string"] = ('O_D == "' + df["O_D"]
+ '" & Terminal == "' + df["Terminal"]
+ '" & Operator == "' + df["Operator"] + '"')
def link_func(row):
vb = arr.query(row["query_string"])
try:
row["a"] = vb.loc[(vb.START - row["x"]).abs().idxmin()].FlightID
except:
row["a"] = "No_link_found"
df.query('(A_D == "D") & (TROUND_ID == " ")').apply(link_func, axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.