繁体   English   中英

如何提高 python 中的计算速度?

[英]How to improve the calculation's speed in python?

我正在构建一个计算以向我的 dataframe 添加一个新列。 这是我的数据: 在此处输入图像描述

我需要创建一个新列“mob”。 “暴民”的计算是

  1. 如果某一行的“LoanID”与上一行的“LoanID”相同。 例如,如果贷款['LoanId'][0] = 贷款['LoanId'] 1 ;
  2. 如果上一行的“暴民”> 0; 如果是,则该行的“mob”值将从上一行的值加 1; 如果不是,则尝试该行的loan['repay_lbl']是1还是2,如果是,则该行的“mob”值为1;

我的代码如下:

for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
    if loan['mob'][i-1] > 0:
        loan['mob'][i] = loan['mob'][i-1] +1 
    elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
        loan['mob'][i] = 1

该代码将花费 O(n)。 有什么方法可以改进算法并加快速度吗? 我只是Python的初学者。 非常感谢您的帮助。

由于每一行的mob列的值取决于前一行的值,因此它取决于所有先前的行 这意味着您不能并行运行它,并且您基本上被困在O(n)中。

所以我不认为 numpy 数组操作在这里会有多大用处。

如果做不到这一点,通常有一些技巧可以加快 Python 代码;

我不确定前两个是否适用于 numpy/pandas。 在这些情况下,您可能必须为您的数据使用普通的 Python 列表。

当然,在深入研究其中任何一个之前,您应该考虑您的数据集是否足够大以保证付出努力。

通过改变循环方法来提高时间

基于改进循环时间

  • 循环遍历所有 N 行而不广播,所以复杂度是 O(N)
  • 虽然都是 N 阶,但不同的循环方法具有不同的复杂度缩放因子
  • 不同的缩放因子使某些方法比其他方法快得多

灵感来自 - 在 Pandas Dataframe 中迭代行的不同方法 - 性能比较

方法

  1. For循环——原帖
  2. 迭代
  3. 迭代
  4. zip

概括

对于 100K 行,zip 方法比 for 循环(即 OP 方法)快 93 倍

测试代码

import pandas as pd
import numpy as np
from random import randint

def create_input(N):
    ' Creates a loan DataFrame with N rows '
    LoanId = [randint(0, N //4) for _ in range(N)]  # though random, N//4 ensures
                                                    # high likelihood some rows repeat
                                                    # LoanID
    repay_lbl = [randint(0, 2) for _ in range(N)]

    data = {'LoanId':LoanId, 'repay_lbl': repay_lbl, 'mob':[0]*N}
    return pd.DataFrame(data)

def m_itertuples(loan):
    ' Iterating using itertuples, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob = None, None
    for index, row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
        if prev_loanID is not None:
             if prev_loanID == row.LoanId:
                if prev_mob > 0:
                    loan.at[row.Index, 'mob'] = prev_mob + 1 
                elif row.repay_lbl == 1 or row.repay_lbl == 2:
                    loan.at[row.Index, 'mob'] = 1
            
        # Query for latest values   
        prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
                    
    return loan
    
def m_for_loop(loan):
    ' For loop over the data frame '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
            
    for i in range(1,len(loan['LoanId'])):
        if loan['LoanId'][i-1] == loan['LoanId'][i]:
            if loan['mob'][i-1] > 0:
                loan['mob'][i] = loan['mob'][i-1] +1 
            elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
                loan['mob'][i] = 1
    return loan

def m_iterrows(loan):
    ' Iterating using iterrows, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob = None, None
    for index, row in loan.iterrows(): # iterate over rows with iterrows()
        if prev_loanID is not None:
             if prev_loanID == row['LoanId']:
                if prev_mob > 0:
                    loan.at[index, 'mob'] = prev_mob + 1 
                elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
                    loan.at[index, 'mob'] = 1
                    
        # Query for latest values          
        prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
        
    return loan

def m_zip(loan):
    ' Iterating using zip, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob  = None, None
    for index, (loanID, mob, repay_lbl) in enumerate(zip(loan['LoanId'], loan['mob'], loan['repay_lbl'])):
        if prev_loanID is not None:
             if prev_loanID == loanID:
                if prev_mob > 0:
                    mob = loan.at[index, 'mob'] = prev_mob + 1
                elif repay_lbl == 1 or repay_lbl == 2:
                    mob = loan.at[index, 'mob'] = 1
        
        # Update to latest values
        prev_loanID, prev_mob = loanID, mob
        
    return loan

注意:迭代器代码查询 dataframe 以获取更新数据,而不是从迭代器获取警告

你永远不应该修改你正在迭代的东西。 这不能保证在所有情况下都有效。 根据数据类型,迭代器返回一个副本而不是一个视图,写入它不会有任何效果。

还使用assert df1.equals(df2)比较了 DataFrame,以验证不同的方法产生了相同的结果

计时码

使用benchit

inputs = [create_input(i) for i in 10**np.arange(6)]  # 1 to 10^5 rows
funcs = [m_for_loop, m_iterrows, m_itertuples, m_zip]

t = benchit.timings(funcs, inputs)

结果

以秒为单位的运行时间

Functions  m_for_loop  m_iterrows  m_itertuples     m_zip
Len                                                      
1            0.000217    0.000493      0.000781  0.000327
10           0.001070    0.002002      0.001008  0.000353
100          0.007100    0.016501      0.003062  0.000498
1000         0.056940    0.162423      0.021396  0.001057
10000        0.565809    1.625043      0.210858  0.006938
100000       5.890920   16.658842      2.179602  0.062953

方法时序

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM