简体   繁体   English

如何提高 python 中的计算速度?

[英]How to improve the calculation's speed in python?

I'm building a calculation to add a new column to my dataframe.我正在构建一个计算以向我的 dataframe 添加一个新列。 Here is my data:这是我的数据: 在此处输入图像描述

I need to create a new column "mob".我需要创建一个新列“mob”。 The calculation of "mob" is that “暴民”的计算是

  1. if the "LoanID" of a certain row is the same as the one of the previous row.如果某一行的“LoanID”与上一行的“LoanID”相同。 For example, if loan['LoanId'][0] = loan['LoanId'] 1 ;例如,如果贷款['LoanId'][0] = 贷款['LoanId'] 1 ;
  2. if the "mob" the previous row is >0;如果上一行的“暴民”> 0; if so, then the "mob" value of this row will add 1 from the value of previous row;如果是,则该行的“mob”值将从上一行的值加 1; if not, try if the loan['repay_lbl'] of the row is 1 or 2, if so, the "mob" value of the row will be 1;如果不是,则尝试该行的loan['repay_lbl']是1还是2,如果是,则该行的“mob”值为1;

My code is as below:我的代码如下:

for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
    if loan['mob'][i-1] > 0:
        loan['mob'][i] = loan['mob'][i-1] +1 
    elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
        loan['mob'][i] = 1

The code will cost O(n).该代码将花费 O(n)。 Is there any way to improve the algorithm and speed up?有什么方法可以改进算法并加快速度吗? I'm just a beginner of Python.我只是Python的初学者。 I would appreciate so much for your help.非常感谢您的帮助。

Since the value of the mob column for each row depends on that of the previous row, it depends on all previous rows .由于每一行的mob列的值取决于前一行的值,因此它取决于所有先前的行 That means that you can't run this in parallel and you're basically stuck with O(n) .这意味着您不能并行运行它,并且您基本上被困在O(n)中。

So I don't think that numpy array operations are going to be of much use here.所以我不认为 numpy 数组操作在这里会有多大用处。

Failing that, there is the usual bag of tricks to speed up Python code;如果做不到这一点,通常有一些技巧可以加快 Python 代码;

I'm not sure if the first two work well with numpy/pandas.我不确定前两个是否适用于 numpy/pandas。 You might have to use normal Python lists for your data in those cases.在这些情况下,您可能必须为您的数据使用普通的 Python 列表。

Of course before you dive into any of these, you should consider whether your data set is large enough to warrant the effort.当然,在深入研究其中任何一个之前,您应该考虑您的数据集是否足够大以保证付出努力。

Improving Time by Changing Looping Method通过改变循环方法来提高时间

Improving loop time based upon基于改进循环时间

  • Looping through all N rows without broadcasting, so complexity is O(N)循环遍历所有 N 行而不广播,所以复杂度是 O(N)
  • While all are order N, different looping methods have different complexity scaling factors虽然都是 N 阶,但不同的循环方法具有不同的复杂度缩放因子
  • The different scaling factors make some methods much faster than others不同的缩放因子使某些方法比其他方法快得多

Inspired by - Different ways to iterate over rows in a Pandas Dataframe — performance comparison灵感来自 - 在 Pandas Dataframe 中迭代行的不同方法 - 性能比较

Methods方法

  1. For loop -- original post For循环——原帖
  2. iterrows 迭代
  3. itertuples 迭代
  4. zip zip

Summary概括

The zip method was 93x faster than for loop (ie OP method) for 100K rows对于 100K 行,zip 方法比 for 循环(即 OP 方法)快 93 倍

Test Code测试代码

import pandas as pd
import numpy as np
from random import randint

def create_input(N):
    ' Creates a loan DataFrame with N rows '
    LoanId = [randint(0, N //4) for _ in range(N)]  # though random, N//4 ensures
                                                    # high likelihood some rows repeat
                                                    # LoanID
    repay_lbl = [randint(0, 2) for _ in range(N)]

    data = {'LoanId':LoanId, 'repay_lbl': repay_lbl, 'mob':[0]*N}
    return pd.DataFrame(data)

def m_itertuples(loan):
    ' Iterating using itertuples, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob = None, None
    for index, row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
        if prev_loanID is not None:
             if prev_loanID == row.LoanId:
                if prev_mob > 0:
                    loan.at[row.Index, 'mob'] = prev_mob + 1 
                elif row.repay_lbl == 1 or row.repay_lbl == 2:
                    loan.at[row.Index, 'mob'] = 1
            
        # Query for latest values   
        prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
                    
    return loan
    
def m_for_loop(loan):
    ' For loop over the data frame '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
            
    for i in range(1,len(loan['LoanId'])):
        if loan['LoanId'][i-1] == loan['LoanId'][i]:
            if loan['mob'][i-1] > 0:
                loan['mob'][i] = loan['mob'][i-1] +1 
            elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
                loan['mob'][i] = 1
    return loan

def m_iterrows(loan):
    ' Iterating using iterrows, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob = None, None
    for index, row in loan.iterrows(): # iterate over rows with iterrows()
        if prev_loanID is not None:
             if prev_loanID == row['LoanId']:
                if prev_mob > 0:
                    loan.at[index, 'mob'] = prev_mob + 1 
                elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
                    loan.at[index, 'mob'] = 1
                    
        # Query for latest values          
        prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
        
    return loan

def m_zip(loan):
    ' Iterating using zip, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob  = None, None
    for index, (loanID, mob, repay_lbl) in enumerate(zip(loan['LoanId'], loan['mob'], loan['repay_lbl'])):
        if prev_loanID is not None:
             if prev_loanID == loanID:
                if prev_mob > 0:
                    mob = loan.at[index, 'mob'] = prev_mob + 1
                elif repay_lbl == 1 or repay_lbl == 2:
                    mob = loan.at[index, 'mob'] = 1
        
        # Update to latest values
        prev_loanID, prev_mob = loanID, mob
        
    return loan

Note: Iterator code queried dataframe for updated data rather than getting from iterator do to warning :注意:迭代器代码查询 dataframe 以获取更新数据,而不是从迭代器获取警告

You should never modify something you are iterating over.你永远不应该修改你正在迭代的东西。 This is not guaranteed to work in all cases.这不能保证在所有情况下都有效。 Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.根据数据类型,迭代器返回一个副本而不是一个视图,写入它不会有任何效果。

Also compared DataFrames using assert df1.equals(df2) to verify the different methods produced identical results还使用assert df1.equals(df2)比较了 DataFrame,以验证不同的方法产生了相同的结果

Timing Code计时码

Using benchit使用benchit

inputs = [create_input(i) for i in 10**np.arange(6)]  # 1 to 10^5 rows
funcs = [m_for_loop, m_iterrows, m_itertuples, m_zip]

t = benchit.timings(funcs, inputs)

Results结果

Run time in seconds以秒为单位的运行时间

Functions  m_for_loop  m_iterrows  m_itertuples     m_zip
Len                                                      
1            0.000217    0.000493      0.000781  0.000327
10           0.001070    0.002002      0.001008  0.000353
100          0.007100    0.016501      0.003062  0.000498
1000         0.056940    0.162423      0.021396  0.001057
10000        0.565809    1.625043      0.210858  0.006938
100000       5.890920   16.658842      2.179602  0.062953

方法时序

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM