![](/img/trans.png)
[英]Python: Improve the speed of Euclidean distance calculation in a class
[英]How to improve the calculation's speed in python?
我正在构建一个计算以向我的 dataframe 添加一个新列。 这是我的数据:
我需要创建一个新列“mob”。 “暴民”的计算是
我的代码如下:
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
该代码将花费 O(n)。 有什么方法可以改进算法并加快速度吗? 我只是Python的初学者。 非常感谢您的帮助。
通过改变循环方法来提高时间
基于改进循环时间
灵感来自 - 在 Pandas Dataframe 中迭代行的不同方法 - 性能比较
方法
概括
对于 100K 行,zip 方法比 for 循环(即 OP 方法)快 93 倍
测试代码
import pandas as pd
import numpy as np
from random import randint
def create_input(N):
' Creates a loan DataFrame with N rows '
LoanId = [randint(0, N //4) for _ in range(N)] # though random, N//4 ensures
# high likelihood some rows repeat
# LoanID
repay_lbl = [randint(0, 2) for _ in range(N)]
data = {'LoanId':LoanId, 'repay_lbl': repay_lbl, 'mob':[0]*N}
return pd.DataFrame(data)
def m_itertuples(loan):
' Iterating using itertuples, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row.LoanId:
if prev_mob > 0:
loan.at[row.Index, 'mob'] = prev_mob + 1
elif row.repay_lbl == 1 or row.repay_lbl == 2:
loan.at[row.Index, 'mob'] = 1
# Query for latest values
prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
return loan
def m_for_loop(loan):
' For loop over the data frame '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
return loan
def m_iterrows(loan):
' Iterating using iterrows, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, row in loan.iterrows(): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row['LoanId']:
if prev_mob > 0:
loan.at[index, 'mob'] = prev_mob + 1
elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
loan.at[index, 'mob'] = 1
# Query for latest values
prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
return loan
def m_zip(loan):
' Iterating using zip, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, (loanID, mob, repay_lbl) in enumerate(zip(loan['LoanId'], loan['mob'], loan['repay_lbl'])):
if prev_loanID is not None:
if prev_loanID == loanID:
if prev_mob > 0:
mob = loan.at[index, 'mob'] = prev_mob + 1
elif repay_lbl == 1 or repay_lbl == 2:
mob = loan.at[index, 'mob'] = 1
# Update to latest values
prev_loanID, prev_mob = loanID, mob
return loan
注意:迭代器代码查询 dataframe 以获取更新数据,而不是从迭代器获取警告:
你永远不应该修改你正在迭代的东西。 这不能保证在所有情况下都有效。 根据数据类型,迭代器返回一个副本而不是一个视图,写入它不会有任何效果。
还使用assert df1.equals(df2)
比较了 DataFrame,以验证不同的方法产生了相同的结果
计时码
使用benchit
inputs = [create_input(i) for i in 10**np.arange(6)] # 1 to 10^5 rows
funcs = [m_for_loop, m_iterrows, m_itertuples, m_zip]
t = benchit.timings(funcs, inputs)
结果
以秒为单位的运行时间
Functions m_for_loop m_iterrows m_itertuples m_zip
Len
1 0.000217 0.000493 0.000781 0.000327
10 0.001070 0.002002 0.001008 0.000353
100 0.007100 0.016501 0.003062 0.000498
1000 0.056940 0.162423 0.021396 0.001057
10000 0.565809 1.625043 0.210858 0.006938
100000 5.890920 16.658842 2.179602 0.062953
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.