简体   繁体   English

For loop pandas 和 numpy:性能

[英]For loop pandas and numpy: Performance

I have coded the following for loop.我已经编写了以下 for 循环。 The main idea is that in each occurrence of 'D' in the column 'A_D', it looks for all the possible cases where some specific conditions should happen.主要思想是,在“A_D”列中每次出现“D”时,它都会查找某些特定条件应该发生的所有可能情况。 When all the conditions are verified, a value is added to a list.当所有条件都得到验证时,一个值被添加到列表中。

a = []
for i in df.index:
    if df['A_D'][i] == 'D':
         if df['TROUND_ID'][i] == '        ':
             vb = df[(df['O_D'] == df['O_D'][i])
             & (df['A_D'] == 'A' )
             & (df['Terminal'] == df['Terminal'][i])
             & (df['Operator'] == df['Operator'][i])]

            number = df['number_ac'][i]
            try: ## if all the conditions above are verified a value is added to a list
                x = df.START[i] - pd.Timedelta(int(number), unit='m')
                value = vb.loc[(vb.START-x).abs().idxmin()].FlightID
            except: ## if are not verified, several strings are added to the list
                value = 'No_link_found'
        else:
            value = 'Has_link'
    else:
        value = 'IsArrival'
a.append(value)

My main problem is that df has millions of rows, therefore this for loop is way too time consuming.我的主要问题是 df 有数百万行,因此这个 for 循环太耗时了。 Is there any vectorized solution where I do not need to use a for loop?是否有任何不需要使用 for 循环的矢量化解决方案?

An initial set of improvements: use apply rather than a loop;最初的一组改进:使用apply而不是循环; create a second dataframe at the start of the rows where df["A_D"] == "A" ;df["A_D"] == "A"的行的开头创建第二个数据帧; and vectorise the value x .并向量化值x

arr = df[df["A_D"] == "A"]
# if the next line is slow, apply it only to those rows where x is needed
df["x"] = df.START - pd.Timedelta(int(df["number_ac"]), unit='m')

def link_func(row):
    if row["A_D"] != "D":
        return "IsArrival"
    if row["TROUND_ID"] != "        ":
        return "Has_link"
    vb = arr[arr["O_D"] == row["O_D"]
             & arr["Terminal"] == row["Terminal"]
             & arr["Operator"] == row["Operator"]]
    try:
        return vb.loc[(vb.START - row["x"]).abs().idxmin()].FlightID
    except:
        return "No_link_found"            

df["a"] = df.apply(link_func, axis=1)

Using apply is apparently more efficient but does not automatically vectorise the calculation.使用apply 显然更有效,但不会自动矢量化计算。 But finding a value in arr based on each row of df is inherently time consuming, however efficiently it is implemented.但是根据df每一行在arr找到一个值本质上是耗时的,但它的实现效率如何。 Consider whether the two parts of the original dataframe (where df["A_D"] == "A" and df["A_D"] == "D" , respectively) can be reshaped into a wide format somehow.考虑是否可以以某种方式将原始数据帧的两个部分(其中df["A_D"] == "A"df["A_D"] == "D" )改造成宽格式。

EDIT: You might be able to speed up the querying of arr by storing query strings in df , like this:编辑:您可以通过将查询字符串存储在df来加快对arr的查询,如下所示:

df["query_string"] = ('O_D == "' + df["O_D"] 
                    + '" & Terminal == "' + df["Terminal"] 
                    + '" & Operator == "' + df["Operator"] + '"')
def link_func(row):
    vb = arr.query(row["query_string"])
    try:
        row["a"] = vb.loc[(vb.START - row["x"]).abs().idxmin()].FlightID
    except:
        row["a"] = "No_link_found"

df.query('(A_D == "D") & (TROUND_ID == "        ")').apply(link_func, axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM