[英]How to improve the calculation's speed in python?
I'm building a calculation to add a new column to my dataframe.我正在构建一个计算以向我的 dataframe 添加一个新列。 Here is my data:这是我的数据:
I need to create a new column "mob".我需要创建一个新列“mob”。 The calculation of "mob" is that “暴民”的计算是
My code is as below:我的代码如下:
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
The code will cost O(n).该代码将花费 O(n)。 Is there any way to improve the algorithm and speed up?有什么方法可以改进算法并加快速度吗? I'm just a beginner of Python.我只是Python的初学者。 I would appreciate so much for your help.非常感谢您的帮助。
Since the value of the mob
column for each row depends on that of the previous row, it depends on all previous rows .由于每一行的mob
列的值取决于前一行的值,因此它取决于所有先前的行。 That means that you can't run this in parallel and you're basically stuck with O(n)
.这意味着您不能并行运行它,并且您基本上被困在O(n)
中。
So I don't think that numpy array operations are going to be of much use here.所以我不认为 numpy 数组操作在这里会有多大用处。
Failing that, there is the usual bag of tricks to speed up Python code;如果做不到这一点,通常有一些技巧可以加快 Python 代码;
I'm not sure if the first two work well with numpy/pandas.我不确定前两个是否适用于 numpy/pandas。 You might have to use normal Python lists for your data in those cases.在这些情况下,您可能必须为您的数据使用普通的 Python 列表。
Of course before you dive into any of these, you should consider whether your data set is large enough to warrant the effort.当然,在深入研究其中任何一个之前,您应该考虑您的数据集是否足够大以保证付出努力。
Improving Time by Changing Looping Method通过改变循环方法来提高时间
Improving loop time based upon基于改进循环时间
Inspired by - Different ways to iterate over rows in a Pandas Dataframe — performance comparison灵感来自 - 在 Pandas Dataframe 中迭代行的不同方法 - 性能比较
Methods方法
Summary概括
The zip method was 93x faster than for loop (ie OP method) for 100K rows对于 100K 行,zip 方法比 for 循环(即 OP 方法)快 93 倍
Test Code测试代码
import pandas as pd
import numpy as np
from random import randint
def create_input(N):
' Creates a loan DataFrame with N rows '
LoanId = [randint(0, N //4) for _ in range(N)] # though random, N//4 ensures
# high likelihood some rows repeat
# LoanID
repay_lbl = [randint(0, 2) for _ in range(N)]
data = {'LoanId':LoanId, 'repay_lbl': repay_lbl, 'mob':[0]*N}
return pd.DataFrame(data)
def m_itertuples(loan):
' Iterating using itertuples, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row.LoanId:
if prev_mob > 0:
loan.at[row.Index, 'mob'] = prev_mob + 1
elif row.repay_lbl == 1 or row.repay_lbl == 2:
loan.at[row.Index, 'mob'] = 1
# Query for latest values
prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
return loan
def m_for_loop(loan):
' For loop over the data frame '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
return loan
def m_iterrows(loan):
' Iterating using iterrows, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, row in loan.iterrows(): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row['LoanId']:
if prev_mob > 0:
loan.at[index, 'mob'] = prev_mob + 1
elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
loan.at[index, 'mob'] = 1
# Query for latest values
prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
return loan
def m_zip(loan):
' Iterating using zip, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, (loanID, mob, repay_lbl) in enumerate(zip(loan['LoanId'], loan['mob'], loan['repay_lbl'])):
if prev_loanID is not None:
if prev_loanID == loanID:
if prev_mob > 0:
mob = loan.at[index, 'mob'] = prev_mob + 1
elif repay_lbl == 1 or repay_lbl == 2:
mob = loan.at[index, 'mob'] = 1
# Update to latest values
prev_loanID, prev_mob = loanID, mob
return loan
Note: Iterator code queried dataframe for updated data rather than getting from iterator do to warning :注意:迭代器代码查询 dataframe 以获取更新数据,而不是从迭代器获取警告:
You should never modify something you are iterating over.你永远不应该修改你正在迭代的东西。 This is not guaranteed to work in all cases.这不能保证在所有情况下都有效。 Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.根据数据类型,迭代器返回一个副本而不是一个视图,写入它不会有任何效果。
Also compared DataFrames using assert df1.equals(df2)
to verify the different methods produced identical results还使用assert df1.equals(df2)
比较了 DataFrame,以验证不同的方法产生了相同的结果
Timing Code计时码
inputs = [create_input(i) for i in 10**np.arange(6)] # 1 to 10^5 rows
funcs = [m_for_loop, m_iterrows, m_itertuples, m_zip]
t = benchit.timings(funcs, inputs)
Results结果
Run time in seconds以秒为单位的运行时间
Functions m_for_loop m_iterrows m_itertuples m_zip
Len
1 0.000217 0.000493 0.000781 0.000327
10 0.001070 0.002002 0.001008 0.000353
100 0.007100 0.016501 0.003062 0.000498
1000 0.056940 0.162423 0.021396 0.001057
10000 0.565809 1.625043 0.210858 0.006938
100000 5.890920 16.658842 2.179602 0.062953
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.