简体   繁体   English

Python 'for' 循环性能太慢

[英]Python 'for' loop performance too slow

I have over 500,000 rows in my dataframe and a number of similar 'for' loops which are causing my code to take over a hour to complete its computation.我的 dataframe 中有超过 500,000 行,还有许多类似的“for”循环导致我的代码需要一个多小时才能完成计算。 Is there a more efficient way of writing the following 'for' loop so that things run a lot faster:是否有更有效的方法来编写以下“for”循环,以便运行得更快:

col_26 = []
col_27 = []
col_28 = []


for ind in df.index:
    if df['A_factor'][ind] > df['B_factor'][ind]:
        col_26.append('Yes')
        col_27.append('No')
        col_28.append(df['A_value'][ind])
    elif df['A_factor'][ind] < df['B_factor'][ind]:
        col_26.append('No')
        col_27.append('Yes')
        col_28.append(df['B_value'][ind])
    else:
        col_26.append('')
        col_27.append('')
        col_28.append(float('nan'))

You might want to look into the pandas iterrows() function or using apply, you can look at this article aswell: https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06 You might want to look into the pandas iterrows() function or using apply, you can look at this article aswell: https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-更快-805030df4f06

Try column operations:尝试列操作:

data = {'A_factor': [1, 2, 3, 4, 5],
        'A_value': [10, 20, 30, 40, 50],
           'B_factor': [2, 3, 1, 2, 6],
        'B_value': [11, 22, 33, 44, 55]}
df = pd.DataFrame(data)
df['col_26'] = ''
df['col_27'] = ''
df['col_28'] = np.nan

mask = df['A_factor'] > df['B_factor']
df.loc[mask, 'col_26'] = 'Yes'
df.loc[~mask, 'col_26'] = 'No'
df.loc[mask, 'col_28'] = df[mask]['A_value']

df.loc[~mask, 'col_27'] = 'Yes'
df.loc[mask, 'col_27'] = 'No'
df.loc[~mask, 'col_28'] = df[~mask]['B_value']

Appending to lists in Python is painfully slow.附加到 Python 中的列表非常缓慢。 Initializing the lists before the iteration can speed things up.在迭代之前初始化列表可以加快速度。 For example,例如,

def f():
    x = []
    for ii in range(500000):
        x.append(str(x))

def f2():
    x = [""] * 500000
    for ii in range(500000):
        x[ii] = str(x)


timeit.timeit("f()", "from __main__ import f", number=10)
# Output: 1.6317970999989484
timeit.timeit("f2()", "from __main__ import f2", number=10)
# Output: 1.3037318000024243

Since you're already using pandas / numpy, there are ways to vectorize your operations so they don't need looping.由于您已经在使用 pandas / numpy,因此有一些方法可以矢量化您的操作,因此它们不需要循环。 For example:例如:

a_factor = df["A_factor"].to_numpy()
b_factor = df["B_factor"].to_numpy()

col_26 = np.empty(a_factor.shape, dtype='U3') # U3 => string of size 3
col_27 = np.empty(a_factor.shape, dtype='U3')
col_28 = np.empty(a_factor.shape)

a_greater = a_factor > b_factor
b_greater = a_factor < b_factor
both_equal = a_factor == b_factor

col_26[a_greater] = 'Yes'
col_26[b_greater] = 'No'

col_27[a_greater] = 'Yes'
col_27[b_greater] = 'No'

col_28[a_greater] = a_factor[a_greater]
col_28[b_greater] = b_factor[b_greater]
col_28[both_equal] = np.nan

append causes python requests for heap memory to get more memory. append导致 python 请求堆 memory 以获取更多 ZCD69B4957F06CD818D7BF3D61980E29 using append in for loop causes get memory and free it continually to get more memory.for循环中使用append导致获得 memory 并不断释放它以获得更多 memory。 so it's better to say to python how many item you need.所以最好告诉 python 你需要多少物品。

col_26 = [True]*500000
col_27 = [False]*500000
col_28 = [float('nan')]*500000

for ind in df.index:
    if df['A_factor'][ind] > df['B_factor'][ind]:
        col_28[ind] = df['A_value'][ind]
    elif df['A_factor'][ind] < df['B_factor'][ind]:
        col_26[ind] = False
        col_27[ind] = True
        col_28[ind] = df['B_value'][ind]
    else:
        col_26[ind] = ''
        col_27[ind] = ''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM